The Institutional Books Corpus
Institutional Books is
a practice in exploration.
To better understand the corpus and its potential impact, we analyzed the dataset’s coverage across time, topic, and language.
Language Coverage
We conducted text-level language detection on the OCR-extracted text through which we identified the presence of 379 unique languages.
The results of our analysis confirm that this collection focuses mainly on Western European languages, particularly English, while offering varying levels of coverage for a long tail of languages.
The table above displays the top 10 languages based on the total number of detected o200k_base tokens. For each of the top 10 languages represented in the collection, we detected more than 1B o200k_base tokens.
Temporal Coverage
To get a sense of the collection’s temporal coverage, we analyzed each volume’s bibliographical metadata.
Of the 67% of books with a precise publication date, the majority were published in the 19th and 20th centuries.
150K Volumes
100K
50K
Topic Classification
We conducted a series of experiments to classify volumes using the first level of Library of Congress’ Classification Outline.
Language and Literature
Law
Philosophy, Psychology, Religion
Science
Social Science
Agriculture
Auxiliary Sciences of History
Medicine
History of the Americas
Political Science
In our analysis, we found a concentration of volumes on the topics of Language and Literature; Law; Philosophy, Psychology, Religion; and Science. The table above displays the top ten topics out of twenty topics in total.
Institutional Books is
a practice in refinement.
OPEN SOURCED JUNE 12, 2025
Our Pipeline
Our pipeline contributes experiments aimed at retrieving, analyzing and refining the source material in order to make the resulting dataset easier to filter, read, and use, for humans and machines alike.
Retrieval
Analysis
Refinement
Retrieval of Source Material
Retrieving over one million books stored on Google Books’ servers required writing a custom retrieval pipeline, which we intend to release as open-source software following further refinement. For each volume, we sought to retrieve a .tar.gz file, containing scan images, OCR data, as well as bibliographic and processing-related metadata.
Analysis of Source Material
To enable effective use of the dataset, we underwent a process of analyzing the temporal, language, and topic coverage of the collection. The image below demonstrates how text-level language detection identified the existence of both French and Latin in a book that was previously cataloged as Latin.
Refinement of OCR-Extracted Text
While the quality of the OCR-Extracted text is satisfactory at the character or word level, we observed semantic and positional decontextualization that came from exporting OCR data as plain text.
As a first step towards improving the usability of the OCR-extracted text, we developed a post-processing pipeline that reassembled the OCR-extracted text using the detected type of each line as a signal.
Learn more about the pipeline in our report.
Institutional Books is
a practice in community.
With the release of this dataset, we seek to establish a community-led process to grow, improve, and use data in ways that strengthen the knowledge ecosystem and the underlying data itself. We envision an institutional commons, supported by community and collaborative research, which incorporates improvements from the AI and research communities for collective benefit. We welcome collaboration from researchers, model makers, and technologists in the following research areas:
Evals & Benchmarks
We see opportunities for Institutional Books to improve model outputs on the axis of long context, multilingual capabilities, and more.
We welcome model makers and AI labs interested in co-developing benchmarks and evaluate the impact of Institutional Books on their models.
Data Refinement and OCR
Our goal is for Institutional Books to best represent the original source material. We invite continued refinement of the OCR-extracted plain text as well as initiatives to re-OCR the dataset and export it as structured text. We believe this process holds potential for developing better OCR pipelines for library use.
Institutional Books is
a practice in stewardship.
IDI partners with libraries to surface collections for the public interest.
The Institutional Data Initiative is built on the belief that libraries have the expertise and the data, in the form of their collections, to influence AI’s trajectory towards the public interest. As AI is poised to change how people access knowledge, this is a powerful point of leverage that libraries can use to assert their leadership and engender collaboration in the development of beneficial AI.
We work with libraries to release structured and refined collections around which AI development and research can transparently unfold. Our goal is to expand the diversity of information, languages, and cultures represented in current models while making information more accessible for the patrons libraries serve.
Martha Whitehead
Harvard University, University Librarian
"As stewards of the public domain and curators of diverse, trustworthy collections, we have the foundational materials needed to train inclusive AI systems.
Through initiatives like IDI, we aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all."


.png)

