The Common Pile

13 hours ago 4

This repository tracks the code used to collect, process, and prepare the datasets for the Common Pile. The code used for the preparation of each source in the Common Pile can be found in the sources/ subdirectory. Source-agnostic utility code and scripts are provided in the common_pile package. If you are looking for the data itself or our trained models, please see our Hugging Face organization.

The majority of packages required for dataset creation can be installed with pip install -r requirements.txt. To make use of the shared functionality in the common_pile pckage, run pip install -e .. If you are on a system that doesn't support automatic installation of pandoc with pypandoc_binary, change it to pypandoc in the requirements.txt and and install pandoc manually.

If you'd like to contribute a new source to the Common Pile, please start an issue to share details of the source. Generally, we expect each source to include code that 1) downloads the data, 2) processes it appropriately to retain primarily plain text, and 3) write out the results in the Dolma format (gzipped jsonl). You can find utilities to help with each of these steps in the common_pile library. Alternatively, you can look at our existing sources for ideas as to how to prepare a source. We use git pre-commit hooks to format code and keep style consistent. You can install the pre-commit libraries with pip install pre-commit and insert the pre-commit hooks with pre-commit install from the repository root.

The scripts subdirectory has various scripts that can be helpful for inspecting or computing statistics over data. Alternatively, the Dolma-formatted files can be inspected with jq by running

cat ${file}.jsonl.gz | gunzip | jq -s ${commmand}

Read Entire Article