Improving Machine-Readable Text for Newspapers in Chronicling America

5 days ago 4

The following is a guest post by Nathan Yarasavage, a Digital Projects Specialist in the Serial and Government Publications Division.

Back in 2005, the National Endowment for the Humanities (NEH) and the Library of Congress launched the National Digital Newspaper Program (NDNP), the program that supports Chronicling America and its keyword-searchable access to now over 23 million newspaper pages. The Library is excited to announce a new effort to reprocess some of the newspapers that we digitized in the early years of the program. The reason behind this effort is to improve the machine-readable text that powers the keyword search of this rich content.

Machine-readable text is created by a technology called Optical Character Recognition (OCR). Using the Tesseract Open Source OCR Engine (External) and customized post-processing tools, the Library created a new OCR reprocessing workflow specifically for Chronicling America.

A lot has changed in 20 years, and OCR technology is no exception. By taking advantage of these improvements, the Library will provide a higher-quality search experience. Better OCR yields more accurate search results for users and a cleaner full-text index for our servers.

Pictured below is an example of the improvements gained in just one paragraph of a newspaper that has gone through this new workflow.  Before reprocessing, the machine-readable text was mostly unrecognizable. As a result of our new workflow, newspaper articles like this will now appear in search results.

A screenshot showing a newspaper article in Chronicling America before OCR reprocessing.A screenshot showing a newspaper article in Chronicling America before OCR reprocessing.
A screenshot showing a newspaper article in Chronicling America after OCR reprocessing.A screenshot showing a newspaper article in Chronicling America after OCR reprocessing.

Since its implementation, the new workflow has significantly improved access to newspapers added during the early years of the program. However, given the inherent challenges of working with historical newspapers on microfilm, we always remind users that achieving 100% accuracy in an automated OCR process is virtually impossible. The creation of quality machine-readable text from historical materials on microfilm is impacted by a variety of factors, many out of our control.  Original newsprint can be damaged or suffer from deterioration before it was microfilmed. Additionally, poor quality microfilming practices can lead to subpar images, therefore interfering with the OCR process. On top of these factors, historical newspapers often exhibit tight columns and tiny text sizes that further complicate the OCR process.

The Library team working on this effort is just getting started. To date, we have reprocessed over 170,000 newspaper pages. You can track the progress at the Improved Machine-Readable Text for Newspapers page on the Chronicling America research guide. Additionally, the improved text is available now for searching in the new interface of Chronicling America. Read more about the migration to the new interface.

A screenshot showing improved searchable text in Chronicling AmericaImproved searchable text of the New-York Tribune, (New York, NY), December 21, 1880 in Chronicling America

As technologies improve and evolve, we plan to adapt our workflows even further to take advantage of the new opportunities.  In the future, we will provide more details about the technologies and processes we used.  For questions about the new workflow, please contact [email protected].

The Chronicling America historic newspapers online collection is a product of the National Digital Newspaper Program and jointly sponsored by the Library and the National Endowment for the Humanities. For more guidance on how to use Chronicling America, take a look at Chronicling America: A Guide for Researchers. This guide provides an overview of the collection as well as recommended search topics, search strategies, website features, and frequently asked questions.

Try out a search in the new Chronicling America interface today!

Read Entire Article