Internet: preserving humankind’s intellectual heritage

In 1969, project ARPANET managed to send a message from a computer at University of California in Los Angeles to another computer at the Stanford Research Institute in Menlo Park. Originally developed for military communications, and later for higher education centers, the network kept growing and making more connections until it was finally commercially available for the public. Almost fifty years after its birth, the internet is an inseparable part of our lives. This tool experienced its greatest growth during the 1990s and continues to grow: in the year 2000, 51% of telecommunications occurred through the internet; by 2007, this number had risen to an astonishing 97%.

Every day, about 2.5 million terabytes (the equivalent of filling 28.75 billion iPads) of information are added to the existing 1.1 zettabytes (the equivalent of about 36 thousand years of HD video) on the internet. Ninety percent of the content is less than two years old, and great efforts are being made to upload as much of human-generated information as possible to “the cloud”: the unreal, untouchable space where information can be stored forever.

(Lea esta entrada en español)

Computers had to learn to deal with all this information quickly in order to help users complete their tasks. This led to the birth of bots: software that automatically performs simple repetitive tasks at a much faster speed than humanly possible. While most bots are harmless and even necessary for the internet to work as we know it, some were created with mischief in mind. Some bots, for example, are programmed to override safety measures for ticket sale websites and buy thousands of tickets for a show. These tickets are later resold.

In the year 2000, Guatemalan entrepreneur Luis von Ahn introduced an invention that would forever change internet security. Known as CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), it consists of a slightly distorted image with random characters/words, that the user must enter into a text input box. If the image’s sequence and the user’s sequence match, then the user will successfully complete the attempted action.

What makes CAPTCHA so special is that it was designed as a test that computers cannot pass, but humans can. It has been called the “Reverse Turing Test”, because the original Turing test measured the machine’s ability to perform intelligent human-like behavior (such as structuring coherent sentences or play a game of chess); however, CAPTCHA tries to make sure the user is human.

This is how it works:

Some software (including some resale bots, such as those mentioned above), known as OCR (Optical Character Recognition) can “read” texts that are not in text format: for example, a scanned document, which is technically an image. However, OCR tends to be very limited. If the text is not perfectly legible, the computer will not be able to read it, or may even misread it. On the other hand, a human user can interpret the CAPTCHA image, even if it is distorted, and input the right sequence.

OCR limitations help prevent bots from performing actions that are exclusive to human users, but they also present another challenge: certain organizations like Google and Amazon use OCR software to digitalize books, basically by scanning every page, extracting the text using a computer, and saving the digitalized transcription. This project represents an effort to preserve humankind’s intellectual heritage and make it more accessible to the public.

However, OCR software has trouble reading books over 50 years old since the ink is too blurred or too faint, or there are spots and lines that prevent a correct reading. These problems mean that roughly 30% of the information is lost.

In 2007, Luis von Ahn launched a second version of CAPTCHA, called ReCAPTCHA, which shows two full English words to the user; one of these words is unknown to the computer because OCR could not read it. The user inputs both words: if both are correct, the user is considered human and, if several humans agree on the same reading of the unrecognized word, OCR software will learn from them and digitalize it.

In 2009, Google bought the ReCAPTCHA project, and now 200 million words are processed every day. This joint effort by almost 900 million internet users who have solved a CAPTCHA at least once in their lives helps digitalize approximately 2 million books a year.

CAPTCHA and ReCAPTCHA, however, have seen a decline in use and acceptance. Most users consider CAPTCHA annoying and unnecessary, because they do not know its true function. Answering a CAPTCHA means more than taking an extra 10 seconds to complete an action on the internet and adding a layer of security. It means that human users are playing an important role in the evolution of technology and artificial intelligence: we are teaching machines how a human being thinks and behaves, allowing them –in a future that may or may not be distant- to help us do our jobs faster and, above all things, preserve human knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *