Diverse LLM subsets via k-means (100K-1M) [Pretraining, IF, Reasoning]

1 month ago 2

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity while re-balancing prevents category dominance across 5 high-quality open datasets.


This project provides diverse, representative subsets from large-scale training corpora across multiple domains using embedding-based k-means clustering rather than random sampling:

  • Scales: 50k, 100k, 250k, 500k, and 1M samples
  • Methodology: Deterministic k-means clustering on embeddings (Snowflake Arctic-embed-xs) with 100 iterations
  • Balancing: Square-root transformation for imbalanced datasets to prevent category dominance

stratified-kmeans-diverse-pretraining-100K-1M

Combines FineWeb-Edu (educational web content) and Proof-Pile-2 (mathematical/scientific documents):

  • FineWeb-Edu: 6 CommonCrawl snapshots from 2025 (99M rows filtered)
  • Proof-Pile-2: algebraic-stack, arxiv, open-web-math

Instruction-Following Dataset

stratified-kmeans-diverse-instruction-following-100K-1M

Combines Tulu-3 SFT Mixture and Orca AgentInstruct:

  • Tulu-3: State-of-the-art post-training recipe (939K samples)
  • Orca AgentInstruct: Agentic multi-step reasoning tasks (~1M samples)

stratified-kmeans-diverse-reasoning-100K-1M

Stratified subset of Llama-Nemotron Post-Training Dataset with square-root rebalancing:

  • Original: 80.52% STEM dominated → Rebalanced: 51.81% STEM
  • Categories: math, code, science, chat, safety

Embedding-Based K-Means Clustering

  1. Embedding Generation: Text embedded using Snowflake Arctic-embed-xs
  2. K-Means Clustering: For M required samples, apply k-means with k=M clusters (100 iterations)
  3. Centroid Selection: Select cluster centroids as representative samples
  4. Square-Root Balancing (for imbalanced datasets):
    • Convert category counts to ratios
    • Apply sqrt transformation: sqrt_ratio = sqrt(original_ratio)
    • Renormalize: balanced_ratio = sqrt_ratio / sum(sqrt_ratios)

Example: Llama-Nemotron Rebalancing

Original Llama-Nemotron Post-Training Dataset distribution was heavily skewed:

  • Math: 66.96% → rebalanced to 52.03% (−22%)
  • Code: 30.67% → rebalanced to 34.96% (+14%)
  • Science: 2.15% → rebalanced to 9.26% (+330%)
  • Chat: 0.12% → rebalanced to 2.15% (+1682%)
  • Safety: 0.10% → rebalanced to 1.60% (+1580%)

Square-root transformation reduces math dominance while significantly increasing representation of underrepresented categories.


from datasets import load_dataset # Load pre-training data pretraining = load_dataset( "AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M", split="100k" ) # Load instruction-following data instruction = load_dataset( "AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M", split="100k" ) # Load reasoning data reasoning = load_dataset( "AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M", split="100k" )


@misc{priyanshu2025stratifiedllm, title={{Stratified LLM Subsets: Pre-Training, Instruction-Following, and Reasoning SFT Data at 100K-1M Scale}}, author={Priyanshu, Aman and Vijay, Supriti}, year={2025}, howpublished={\url{https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/}}, note={Available at \url{https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M}} }

Each subset inherits the license from its source datasets. Please refer to individual dataset cards for complete licensing terms.



Project Website: amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale

Read Entire Article