Google Promotes Open Source OCR Library
"You might wonder," reads a Google corporate blog post this morning, "why Google is interested in [optical character recognition]." Indeed, you might wonder that if you didn't already know that Google has been deeply involved with an on-again/off-again project to produce a digital library of the world's literary material.
Although the future of the project remains up in the air, work continues on one of the technical prerequisites to making such a library possible: a project called Tesseract, begun in 1985 at the University of Nevada at Las Vegas. The school worked with HP to construct a reliable OCR system that works with all manners of printed text.
As the World Wide Web started to take root, Tesseract began losing ground, perhaps mainly due to the reorganization of HP from a research company to a consumer products firm. In 2005, Google apparently made a successful case for UNLV to release Tesseract into open source.
With Google contributing some of its resources toward updates and corrections, the company sponsored the release of a new version of Tesseract last month. But software announcements being what they are in the modern era, they sometimes need to be re-announced, which is why Google stepped up its efforts this morning to make developers aware of Tesseract's availability.
What isn't obvious at first glance is that Tesseract is an application of a neural networking library. Specifically, it implements a system called Aspirin/MIGRAINES, developed by long-time neural network simulator engineer Russell Leighton, and licensed for free although not open-sourced.
For years, neural networks have been known to be the most effective pattern recognition systems, and have thus been applied for use in OCR. Because so few people understand what neural networking truly is, many of the applications that utilize it -- including financial analysis -- don't admit up front to doing so.
For the Asprin system, Leighton implemented a back-propagated network, which learns to recognize patterns through repetitive introduction, analysis, then trial-and-error learning. MIGRAINES serves as the visualization environment for Aspirin developers.
The dependency of Tesseract on Aspirin may make it difficult for open-source developers to sublicense the products of their work to other developers. Aspirin is not licensed under the usual Apache terms; its terms are stated separately.
This isn't much of a problem for Google, though, which for now is mainly interested in seeing developers help perfect Tesseract for its own purposes. To that end, it has put out a call for OCR engineers to join the company.
For now -- even after 21 years -- the Tesseract project appears plagued by the same problem that has baffled OCR engineers with neural networks since the beginning. Judging from comments on Sourceforge, even though Tesseract remains the best performing OCR system ever developed, using UNLV metrics, it still has trouble with diacritical marks such as accents and umlauts.
Typically, once diacriticals become a part of text, they impair the analytical system's ability not only to distinguish accented characters from non-accented ones, but ordinary characters from one another. During much of the 1990s, tests on OCR systems were conducted using English-language text, which is most often umlaut-free.