Google Promotes Open Source OCR Library

By Scott M. Fulton, III
Published 19 years ago

"You might wonder," reads a Google corporate blog post this morning, "why Google is interested in [optical character recognition]." Indeed, you might wonder that if you didn't already know that Google has been deeply involved with an on-again/off-again project to produce a digital library of the world's literary material.

Although the future of the project remains up in the air, work continues on one of the technical prerequisites to making such a library possible: a project called Tesseract, begun in 1985 at the University of Nevada at Las Vegas. The school worked with HP to construct a reliable OCR system that works with all manners of printed text.

As the World Wide Web started to take root, Tesseract began losing ground, perhaps mainly due to the reorganization of HP from a research company to a consumer products firm. In 2005, Google apparently made a successful case for UNLV to release Tesseract into open source.

With Google contributing some of its resources toward updates and corrections, the company sponsored the release of a new version of Tesseract last month. But software announcements being what they are in the modern era, they sometimes need to be re-announced, which is why Google stepped up its efforts this morning to make developers aware of Tesseract's availability.

What isn't obvious at first glance is that Tesseract is an application of a neural networking library. Specifically, it implements a system called Aspirin/MIGRAINES, developed by long-time neural network simulator engineer Russell Leighton, and licensed for free although not open-sourced.

For years, neural networks have been known to be the most effective pattern recognition systems, and have thus been applied for use in OCR. Because so few people understand what neural networking truly is, many of the applications that utilize it -- including financial analysis -- don't admit up front to doing so.

For the Asprin system, Leighton implemented a back-propagated network, which learns to recognize patterns through repetitive introduction, analysis, then trial-and-error learning. MIGRAINES serves as the visualization environment for Aspirin developers.

The dependency of Tesseract on Aspirin may make it difficult for open-source developers to sublicense the products of their work to other developers. Aspirin is not licensed under the usual Apache terms; its terms are stated separately.

This isn't much of a problem for Google, though, which for now is mainly interested in seeing developers help perfect Tesseract for its own purposes. To that end, it has put out a call for OCR engineers to join the company.

For now -- even after 21 years -- the Tesseract project appears plagued by the same problem that has baffled OCR engineers with neural networks since the beginning. Judging from comments on Sourceforge, even though Tesseract remains the best performing OCR system ever developed, using UNLV metrics, it still has trouble with diacritical marks such as accents and umlauts.

Typically, once diacriticals become a part of text, they impair the analytical system's ability not only to distinguish accented characters from non-accented ones, but ordinary characters from one another. During much of the 1990s, tests on OCR systems were conducted using English-language text, which is most often umlaut-free.

8 Comments

Google Promotes Open Source OCR Library

8 Responses to Google Promotes Open Source OCR Library

Recent Headlines

Forget about Fake Cell Towers and Spying Threats: Android 16 Introduces New Security Features

Google Expands its AI Overviews to YouTube App, Starting with U.S. Premium

Apple’s CarPlay Ultra Comes to a Halt as Industry Giants Start Changing Their Minds

OpenAI & Microsoft Partnership Is On Shaky Ground as Altman Admits ‘Points of Tension’

Apple’s Liquid Glass Control Center Gets a Much-Needed Fix in iOS 26 Beta 2

Talking to Google Just Got Real: Real-Time Voice Conversations Launched with AI Mode

16 Billion Passwords Exposed: Major Leak Hits Apple, Facebook and Google Users

Most Commented Stories

Betanews Is Growing Alongside You

16 Billion Passwords Exposed: Major Leak Hits Apple, Facebook and Google Users

Will Windows 10 stop working? See if your PC will survive the switch to Windows 11

Apple’s Liquid Glass Control Center Gets a Much-Needed Fix in iOS 26 Beta 2

Apple’s CarPlay Ultra Comes to a Halt as Industry Giants Start Changing Their Minds

Microsoft is making huge changes to Windows 10 and 11, cutting out nagging to use Edge... for some

Fences 6.0 is the essential desktop upgrade for Windows 10 and 11 users -- get it today!

Chaos RAT malware strikes Linux and Windows as hackers exploit its flaws