Picomap: Easy Datasets for Machine Learning

5 days ago 4

picomap makes it easy to store and load datasets for machine learning. It is tiny (<200 LOC) but works well whenever I have a non-standard dataset and want efficient loading.

Actual photo of modern dataset solutions vs picomap.

✅ Fast — writes arrays directly to disk in binary form
✅ Reproducible — per-item hashing for content verification
✅ Simple — one Python file, only dependencies are numpy, xxhash, and tqdm. Tbh you probably don't need the last two.

import numpy as np import picomap as pm # Build a ragged dataset from a generator of arrays lens = np.random.randint(16, 302, size=(101,)) arrs = [np.random.randn(l, 4, 16, 3) for l in lens] pm.build_map(arrs, "toy") assert pm.verify_hash("toy") # Load individual items on demand load, N = pm.get_loader_fn("toy") for i in range(N): assert np.allclose(arrs[i], load(i))

This writes three files.

toy.dat # raw binary data toy.starts.npy # index offsets toy.json # metadata + hash

Function Purpose

build_map(gen, path)	Stream arrays → build dataset on disk
verify_hash(path)	Recompute & validate hash
get_loader_fn(path)	Return (loader_fn, count) for random access
update_hash_with_array(h, arr)	Internal helper (streamed hashing)

All arrays must share the same dtype and trailing dimensions.
The first dimension can be ragged across the dataset (i.e., you can have sequences with shapes (*, d1, d2, ..., dn)).
Use load(i, copy=True) to materialize a slice if you need to modify it.
You can safely share .dat files between processes (read-only).

💾 picomap — simple, safe, hash-verified memory-mapped datasets.
No giant databases or fancy formats. Just NumPy and peace of mind.

Read Entire Article