A TED talk informed a lot of unaware people that many internet users have been helping digitize old books without even noticing it. Luis Von Ahn, a Guatemalan entrepreneur who contributed to the creation of CAPTCHA, revealed a little bit of what's behind this web nuisance.
You've probably seen those annoying codes that you need to type every time you fill in a form, just to prove that you're really human, not an evil computer program. So, every time you type those little letters, you're helping the web digitize old books. "About 900 million people help us digitize books with CAPTCHA, that is, 10% of the humanity," Luis says.
This is how it works: OCR programs, which scan a text page to digitize the words therein, cannot distinguish some letters that look garbled, maybe because of page positioning, fading ink, yellowing paper, or simply because the text is in a language that uses accents, cedilles and other characters that leave computers totally confused.
So, when we type a CAPTCHA word, we're teaching computers how to read and learn new words that came from digitized books, thus improving the quality and accuracy of the old books we want to read. Think about it, if you visit Google Books to look for a book from way back in the 1700s ―which is already in the public domain, hasn't been in print for decades, and is currently distributed for free over the internet― CAPTCHA certainly had a hand in it.
The presenter also explained the relationship between this interactive activity and the effort of translating the internet through volunteers. An extension of this project is called Duolinguo, which will be released in less than a month. I'm actually curious to learn more about how it works, since I'm a professional translator who is deeply concerned with the whole idea that one day computers will be able to translate texts quickly in any language... In the meantime, I still have a job when you see the machine translation example he shows in his talk. What a disaster!
Another cool thing he shared was the CAPTCHArt movement, in which users take a screen capture (pressing the PrintScreen key or using a program for that exact purpose) every time they come across a weird combination at the footer of a form showing two words that were selected by the system at random. One example was the words "invisible" and "toaster" and the user provided a cute drawing to illustrate this combo.
Check out this interesting talk, which is available in English, and read the complete story here in Spanish. Besides, I'd like to know if you're no longer bothered by typing those annoying words at the end of a web form, now that you know what's behind it.