Show HN: Create-LLM – Train your own LLM in 60 seconds

1 day ago 1
npx @theanikrtgiri/create-llm my-awesome-llm cd my-awesome-llm pip install -r requirements.txt python training/train.py

That's it! You're training an LLM.


Training a language model from scratch is complex. You need:

  • ✅ Model architecture (GPT, BERT, T5...)
  • ✅ Data preprocessing pipeline
  • ✅ Tokenizer training
  • ✅ Training loop with callbacks
  • ✅ Checkpoint management
  • ✅ Evaluation metrics
  • ✅ Text generation
  • ✅ Deployment tools

create-llm gives you all of this in one command.


Choose from 4 templates optimized for different use cases:

  • NANO (1M params) - Learn in 2 minutes on any laptop
  • TINY (6M params) - Prototype in 15 minutes on CPU
  • SMALL (100M params) - Production models in hours
  • BASE (1B params) - Research-grade in days

Everything you need out of the box:

  • PyTorch training infrastructure
  • Data preprocessing pipeline
  • Tokenizer training (BPE, WordPiece, Unigram)
  • Checkpoint management with auto-save
  • TensorBoard integration
  • Live training dashboard
  • Interactive chat interface
  • Model comparison tools
  • Deployment scripts

Intelligent configuration that:

  • Auto-detects vocab size from tokenizer
  • Warns about model/data size mismatches
  • Detects overfitting during training
  • Suggests optimal hyperparameters
  • Handles cross-platform paths

Optional integrations:

  • WandB - Experiment tracking
  • HuggingFace - Model sharing
  • SynthexAI - Synthetic data generation

# Using npx (recommended - no installation needed) npx @theanikrtgiri/create-llm my-llm # Or install globally npm install -g @theanikrtgiri/create-llm create-llm my-llm
npx @theanikrtgiri/create-llm

You'll be prompted for:

  • 📝 Project name
  • 🎯 Template (NANO, TINY, SMALL, BASE)
  • 🔤 Tokenizer type (BPE, WordPiece, Unigram)
  • 🔌 Optional plugins (WandB, HuggingFace, SynthexAI)
# Specify everything upfront npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe --skip-install

Perfect for learning and quick experiments

Parameters: ~1M Hardware: Any CPU (2GB RAM) Time: 1-2 minutes Data: 100+ examples Use: Learning, testing, demos

When to use:

  • First time training an LLM
  • Quick experiments and testing
  • Educational purposes
  • Understanding the pipeline
  • Limited data (100-1000 examples)

Perfect for prototyping and small projects

Parameters: ~6M Hardware: CPU or basic GPU (4GB RAM) Time: 5-15 minutes Data: 1,000+ examples Use: Prototypes, small projects

When to use:

  • Small-scale projects
  • Limited data (1K-10K examples)
  • Prototyping before scaling
  • Personal experiments
  • CPU-only environments

Perfect for production applications

Parameters: ~100M Hardware: RTX 3060+ (12GB VRAM) Time: 1-3 hours Data: 10,000+ examples Use: Production, real apps

When to use:

  • Production applications
  • Domain-specific models
  • Real-world deployments
  • Good data availability
  • GPU available

Perfect for research and high-quality models

Parameters: ~1B Hardware: A100 or multi-GPU Time: 1-3 days Data: 100,000+ examples Use: Research, high-quality

When to use:

  • Research projects
  • High-quality requirements
  • Large datasets available
  • Multi-GPU setup
  • Competitive performance needed

npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe cd my-llm
pip install -r requirements.txt

Place your text files in data/raw/:

# Example: Download Shakespeare curl https://www.gutenberg.org/files/100/100-0.txt > data/raw/shakespeare.txt # Or add your own files cp /path/to/your/data.txt data/raw/

💡 Pro Tip: Start with at least 1MB of text for meaningful results

python tokenizer/train.py --data data/raw/

🔤 This creates a vocabulary from your data

📊 This tokenizes and prepares your data for training

# Basic training python training/train.py # With live dashboard (recommended!) python training/train.py --dashboard # Then open http://localhost:5000 # Resume from checkpoint python training/train.py --resume checkpoints/checkpoint-1000.pt

📈 Watch your model learn in real-time!

python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt
python evaluation/generate.py \ --checkpoint checkpoints/checkpoint-best.pt \ --prompt "Once upon a time" \ --temperature 0.8

✨ See your model's creativity in action!

python chat.py --checkpoint checkpoints/checkpoint-best.pt

💬 Chat with your trained model!

# To Hugging Face python deploy.py --to huggingface --repo-id username/my-model # To Replicate python deploy.py --to replicate --model-name my-model

🚀 Share your model with the world!


my-llm/ ├── 📁 data/ │ ├── raw/ # Your training data goes here │ ├── processed/ # Tokenized data (auto-generated) │ ├── dataset.py # PyTorch dataset classes │ └── prepare.py # Data preprocessing script │ ├── 📁 models/ │ ├── architectures/ # Model implementations │ │ ├── gpt.py # GPT architecture │ │ ├── nano.py # 1M parameter model │ │ ├── tiny.py # 6M parameter model │ │ ├── small.py # 100M parameter model │ │ └── base.py # 1B parameter model │ ├── __init__.py │ └── config.py # Configuration loader │ ├── 📁 tokenizer/ │ ├── train.py # Tokenizer training script │ └── tokenizer.json # Trained tokenizer (auto-generated) │ ├── 📁 training/ │ ├── train.py # Main training script │ ├── trainer.py # Trainer class │ ├── callbacks/ # Training callbacks │ │ ├── base.py │ │ ├── checkpoint.py │ │ ├── logging.py │ │ └── checkpoint_manager.py │ └── dashboard/ # Live training dashboard │ ├── dashboard_server.py │ └── templates/ │ ├── 📁 evaluation/ │ ├── evaluate.py # Model evaluation │ └── generate.py # Text generation │ ├── 📁 plugins/ # Optional integrations │ ├── wandb_plugin.py │ ├── huggingface_plugin.py │ └── synthex_plugin.py │ ├── 📁 checkpoints/ # Saved models (auto-generated) ├── 📁 logs/ # Training logs (auto-generated) │ ├── 📄 llm.config.js # Main configuration file ├── 📄 requirements.txt # Python dependencies ├── 📄 chat.py # Interactive chat interface ├── 📄 deploy.py # Deployment script ├── 📄 compare.py # Model comparison tool └── 📄 README.md # Project documentation

Everything is controlled via llm.config.js:

module.exports = { // Model architecture model: { type: 'gpt', size: 'tiny', vocab_size: 10000, // Auto-detected from tokenizer max_length: 512, layers: 4, heads: 4, dim: 256, dropout: 0.2, }, // Training settings training: { batch_size: 16, learning_rate: 0.0006, warmup_steps: 500, max_steps: 10000, eval_interval: 500, save_interval: 2000, optimizer: 'adamw', weight_decay: 0.01, gradient_clip: 1.0, mixed_precision: false, gradient_accumulation_steps: 1, }, // Data settings data: { max_length: 512, stride: 256, val_split: 0.1, shuffle: true, }, // Tokenizer settings tokenizer: { type: 'bpe', vocab_size: 10000, min_frequency: 2, special_tokens: ["<pad>", "<unk>", "<s>", "</s>"], }, // Plugins plugins: [ // 'wandb', // 'huggingface', // 'synthex', ], };

npx @theanikrtgiri/create-llm [project-name] [options]
Option Description Default
--template <name> Template to use (nano, tiny, small, base, custom) Interactive
--tokenizer <type> Tokenizer type (bpe, wordpiece, unigram) Interactive
--skip-install Skip npm/pip installation false
-y, --yes Skip all prompts, use defaults false
-h, --help Show help -
-v, --version Show version -
# Interactive mode (recommended for first time) npx @theanikrtgiri/create-llm # Quick start with defaults npx @theanikrtgiri/create-llm my-project # Specify everything npx @theanikrtgiri/create-llm my-project --template nano --tokenizer bpe --skip-install # Skip prompts npx @theanikrtgiri/create-llm my-project -y

Monitor training in real-time with a web interface:

python training/train.py --dashboard

Then open http://localhost:5000 to see:

  • Real-time loss curves
  • Learning rate schedule
  • Tokens per second
  • GPU memory usage
  • Recent checkpoints

Compare multiple trained models:

python compare.py checkpoints/model-v1/ checkpoints/model-v2/

Shows:

  • Side-by-side metrics
  • Sample generations
  • Performance comparison
  • Recommendation

Automatic checkpoint management:

  • Saves best model based on validation loss
  • Keeps last N checkpoints (configurable)
  • Auto-saves on Ctrl+C
  • Resume from any checkpoint
# Resume training python training/train.py --resume checkpoints/checkpoint-5000.pt # Evaluate specific checkpoint python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt

Create your own plugins:

# plugins/my_plugin.py from plugins.base import BasePlugin class MyPlugin(BasePlugin): def on_train_start(self, trainer): print("Training started!") def on_step_end(self, trainer, step, loss): # Log to your service pass

Minimum Data Requirements:

  • NANO: 100+ examples (good for learning)
  • TINY: 1,000+ examples (minimum for decent results)
  • SMALL: 10,000+ examples (recommended)
  • BASE: 100,000+ examples (for quality)

Data Quality:

  • Use clean, well-formatted text
  • Remove HTML, markdown, or special formatting
  • Ensure consistent encoding (UTF-8)
  • Remove duplicates
  • Balance different content types

Avoid Overfitting:

  • Watch for perplexity < 1.5 (warning sign)
  • Use validation split (10% recommended)
  • Increase dropout if overfitting
  • Add more data if possible
  • Use smaller model for small datasets

Optimize Training:

  • Start with NANO to test pipeline
  • Use mixed precision on GPU (mixed_precision: true)
  • Increase gradient_accumulation_steps if OOM
  • Monitor GPU usage with dashboard
  • Save checkpoints frequently

Hyperparameter Tuning:

  • Learning rate: Start with 3e-4, adjust if unstable
  • Batch size: As large as GPU allows
  • Warmup steps: 10% of total steps
  • Dropout: 0.1-0.3 depending on data size

Before Deploying:

  • Evaluate on held-out test set
  • Test generation quality
  • Check model size
  • Verify inference speed
  • Test on target hardware

Deployment Options:

  • Hugging Face Hub (easiest)
  • Replicate (API endpoint)
  • Docker container (custom)
  • Cloud platforms (AWS, GCP, Azure)

"Vocab size mismatch detected"

  • ✅ This is normal! The tool auto-detects and fixes it
  • The model will use the actual tokenizer vocab size

"Model may be too large for dataset"

  • ⚠️ Warning: Risk of overfitting
  • Solutions: Add more data, use smaller template, increase dropout

"Perplexity < 1.1 indicates severe overfitting"

  • ❌ Model memorized the data
  • Solutions: Add much more data, use smaller model, increase regularization

"CUDA out of memory"

  • Reduce batch_size in llm.config.js
  • Enable mixed_precision: true
  • Increase gradient_accumulation_steps
  • Use smaller model template

"Tokenizer not found"

  • Run python tokenizer/train.py --data data/raw/ first
  • Make sure data/raw/ contains .txt files

"Training loss not decreasing"

  • Check learning rate (try 1e-4 to 1e-3)
  • Verify data is loading correctly
  • Check for data preprocessing issues
  • Try longer warmup period

  • Node.js 18.0.0 or higher
  • npm 8.0.0 or higher
  • Python 3.8 or higher
  • PyTorch 2.0.0 or higher
  • 4GB RAM minimum (NANO/TINY)
  • 12GB VRAM recommended (SMALL)
  • 40GB+ VRAM for BASE
  • ✅ Windows 10/11
  • ✅ macOS 10.15+
  • ✅ Linux (Ubuntu 20.04+)

Setup Development Environment

git clone https://github.com/theaniketgiri/create-llm.git cd create-llm npm install
node dist/index.js test-project --template nano
npm version patch # or minor/major npm publish

Area Description Difficulty
🐛 Bug Fixes Fix issues and improve stability 🟢 Easy
📝 Documentation Improve guides and examples 🟢 Easy
🎨 New Templates Add BERT, T5, custom architectures 🟡 Medium
🔌 Plugins Integrate new services 🟡 Medium
🧪 Testing Increase test coverage 🟡 Medium
🌍 i18n Internationalization support 🔴 Hard

Thanks to all contributors who have helped make this project better!

Contributors


  • More model architectures (BERT, T5)
  • Distributed training support
  • Model quantization tools
  • Fine-tuning templates
  • Web UI for project management
  • Automatic hyperparameter tuning
  • Model compression tools
  • More deployment targets
  • Multi-modal support
  • Reinforcement learning from human feedback
  • Advanced optimization techniques
  • Cloud training integration

MIT © Aniket Giri

See LICENSE for more information.


Built with amazing open-source tools:

Special thanks to the LLM community for inspiration and feedback.


If you find this project useful, please consider giving it a star!

Star History Chart


Made with ❤️ for the LLM community

GitHubnpmIssuesTwitter

If create-llm helped you, consider:

  • ⭐ Starring the repo
  • 🐛 Reporting bugs
  • 💡 Suggesting features
  • 📝 Improving docs
  • 🔀 Contributing code

Together, let's make LLM training accessible to everyone!

Read Entire Article