Historical and Multilingual OCR

The Historical and Multilingual OCR project, which emerged from collaborative conversations between the Mellon Foundation, the National Endowment for the Humanities, and the Library of Congress, will serve as a catalyst for further OCR research and help the community move forward best practices for digital collections using OCR data.

Over the last fifteen years, optical character recognition (OCR) technology has turned billions of images of books, newspapers, and other materials into text searchable at Google Books, the Internet Archive, the Library of Congress, Europeana, and more. The digital abundance now available to students, researchers, and general readers has largely concentrated in the last 100 years, where current systems perform well. Although not perfect for even the most modern documents, currently available OCR systems are much less well suited to older printed texts, whatever their language, and to manuscripts.

In this report, David A. Smith and Ryan Cordell of Northeastern University’s NULab for Texts, Maps, and Networks survey the current state of OCR for historical documents and recommend concrete steps that researchers, implementors, and funders can take to make progress over the next five to ten years. Advances in artificial intelligence for image recognition, natural language processing, and machine learning will drive significant progress. More importantly, sharing goals, techniques, and data among researchers in computer science, in book and manuscript studies, and in library and information sciences will open up exciting new problems and allow the community to allocate resources and measure progress.

This report was written with the generous support of The Andrew W. Mellon Foundation. The conclusions are solely the responsibility of the authors.