Institutional Books by Institutional Data Initiative

4 months ago 4

The Institutional Books Corpus

Institutional Books is

a practice in exploration.

To better understand the corpus and its potential impact, we analyzed the dataset’s coverage across time, topic, and language.

Language Coverage

We conducted text-level language detection on the OCR-extracted text through which we identified the presence of 379 unique languages.

The results of our analysis confirm that this collection focuses mainly on Western European languages, particularly English, while offering varying levels of coverage for a long tail of languages.

The table above displays the top 10 languages based on the total number of detected o200k_base tokens. For each of the top 10 languages represented in the collection, we detected more than 1B o200k_base tokens.

Temporal Coverage

To get a sense of the collection’s temporal coverage, we analyzed each volume’s bibliographical metadata.

Of the 67% of books with a precise publication date, the majority were published in the 19th and 20th centuries.

Dot area graph

150K Volumes

100K

50K

Topic Classification

We conducted a series of experiments to classify volumes using the first level of Library of Congress’ Classification Outline.

Language and Literature

Law

Philosophy, Psychology, Religion

Science

Social Science

Agriculture

Auxiliary Sciences of History

Medicine

History of the Americas

Political Science

In our analysis, we found a concentration of volumes on the topics of Language and Literature; Law; Philosophy, Psychology, Religion; and Science. The table above displays the top ten topics out of twenty topics in total.

dotted line

Institutional Books is

a practice in refinement.

OPEN SOURCED JUNE 12, 2025

Our Pipeline

Our pipeline contributes experiments aimed at retrieving, analyzing and refining the source material in order to make the resulting dataset easier to filter, read, and use, for humans and machines alike.

Icon representing source material retrieval

Retrieval

Icon representing analysis of source material

Analysis

Refinement

Icon representing source material extraction

Retrieval of Source Material

Retrieving over one million books stored on Google Books’ servers required writing a custom retrieval pipeline, which we intend to release as open-source software following further refinement. For each volume, we sought to retrieve a .tar.gz file, containing scan images, OCR data, as well as bibliographic and processing-related metadata.

Extraction Diagram

Icon representing source material extraction

Analysis of Source Material

To enable effective use of the dataset, we underwent a process of analyzing the temporal, language, and topic coverage of the collection. The image below demonstrates how text-level language detection identified the existence of both French and Latin in a book that was previously cataloged as Latin.

analysis Diagram

Refinement of OCR-Extracted Text

While the quality of the OCR-Extracted text is satisfactory at the character or word level, we observed semantic and positional decontextualization that came from exporting OCR data as plain text.

As a first step towards improving the usability of the OCR-extracted text, we developed a post-processing pipeline that reassembled the OCR-extracted text using the detected type of each line as a signal.

Refinement Example

Learn more about the pipeline in our report.

dotted line

Institutional Books is

a practice in community.

With the release of this dataset, we seek to establish a community-led process to grow, improve, and use data in ways that strengthen the knowledge ecosystem and the underlying data itself. We envision an institutional commons, supported by community and collaborative research, which incorporates improvements from the AI and research communities for collective benefit. We welcome collaboration from researchers, model makers, and technologists in the following research areas:

Evals & Benchmarks

We see opportunities for Institutional Books to improve model outputs on the axis of long context, multilingual capabilities, and more.

We welcome model makers and AI labs interested in co-developing benchmarks and evaluate the impact of Institutional Books on their models.

Data Refinement and OCR

Our goal is for Institutional Books to best represent the original source material. We invite continued refinement of the OCR-extracted plain text as well as initiatives to re-OCR the dataset and export it as structured text. We believe this process holds potential for developing better OCR pipelines for library use.

dotted line

Institutional Books is

a practice in stewardship.

IDI partners with libraries to surface collections for the public interest.

The Institutional Data Initiative is built on the belief that libraries have the expertise and the data, in the form of their collections, to influence AI’s trajectory towards the public interest. As AI is poised to change how people access knowledge, this is a powerful point of leverage that libraries can use to assert their leadership and engender collaboration in the development of beneficial AI.

We work with libraries to release structured and refined collections around which AI development and research can transparently unfold. Our goal is to expand the diversity of information, languages, and cultures represented in current models while making information more accessible for the patrons libraries serve.

Martha Whitehead

Harvard University, University Librarian

"As stewards of the public domain and curators of diverse, trustworthy collections, we have the foundational materials needed to train inclusive AI systems.

Through initiatives like IDI, we aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all."

harvard library stamp