Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity while re-balancing prevents category dominance across 5 high-quality open datasets.
This project provides diverse, representative subsets from large-scale training corpora across multiple domains using embedding-based k-means clustering rather than random sampling:
- Scales: 50k, 100k, 250k, 500k, and 1M samples
- Methodology: Deterministic k-means clustering on embeddings (Snowflake Arctic-embed-xs) with 100 iterations
- Balancing: Square-root transformation for imbalanced datasets to prevent category dominance
stratified-kmeans-diverse-pretraining-100K-1M
Combines FineWeb-Edu (educational web content) and Proof-Pile-2 (mathematical/scientific documents):
- FineWeb-Edu: 6 CommonCrawl snapshots from 2025 (99M rows filtered)
- Proof-Pile-2: algebraic-stack, arxiv, open-web-math
stratified-kmeans-diverse-instruction-following-100K-1M
Combines Tulu-3 SFT Mixture and Orca AgentInstruct:
- Tulu-3: State-of-the-art post-training recipe (939K samples)
- Orca AgentInstruct: Agentic multi-step reasoning tasks (~1M samples)
stratified-kmeans-diverse-reasoning-100K-1M
Stratified subset of Llama-Nemotron Post-Training Dataset with square-root rebalancing:
- Original: 80.52% STEM dominated → Rebalanced: 51.81% STEM
- Categories: math, code, science, chat, safety
- Embedding Generation: Text embedded using Snowflake Arctic-embed-xs
- K-Means Clustering: For M required samples, apply k-means with k=M clusters (100 iterations)
- Centroid Selection: Select cluster centroids as representative samples
- Square-Root Balancing (for imbalanced datasets):
- Convert category counts to ratios
- Apply sqrt transformation: sqrt_ratio = sqrt(original_ratio)
- Renormalize: balanced_ratio = sqrt_ratio / sum(sqrt_ratios)
Original Llama-Nemotron Post-Training Dataset distribution was heavily skewed:
- Math: 66.96% → rebalanced to 52.03% (−22%)
- Code: 30.67% → rebalanced to 34.96% (+14%)
- Science: 2.15% → rebalanced to 9.26% (+330%)
- Chat: 0.12% → rebalanced to 2.15% (+1682%)
- Safety: 0.10% → rebalanced to 1.60% (+1580%)
Square-root transformation reduces math dominance while significantly increasing representation of underrepresented categories.
Each subset inherits the license from its source datasets. Please refer to individual dataset cards for complete licensing terms.
- FineWeb-Edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
- Proof-Pile-2: https://huggingface.co/datasets/EleutherAI/proof-pile-2
- Tulu-3 SFT Mixture: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture
- Orca AgentInstruct: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
- Llama-Nemotron Post-Training Dataset: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset
Project Website: amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale
.png)
