That's it! You're training an LLM. ✨
Training a language model from scratch is complex. You need:
- ✅ Model architecture (GPT, BERT, T5...)
- ✅ Data preprocessing pipeline
- ✅ Tokenizer training
- ✅ Training loop with callbacks
- ✅ Checkpoint management
- ✅ Evaluation metrics
- ✅ Text generation
- ✅ Deployment tools
create-llm gives you all of this in one command.
Choose from 4 templates optimized for different use cases:
- NANO (1M params) - Learn in 2 minutes on any laptop
- TINY (6M params) - Prototype in 15 minutes on CPU
- SMALL (100M params) - Production models in hours
- BASE (1B params) - Research-grade in days
Everything you need out of the box:
- PyTorch training infrastructure
- Data preprocessing pipeline
- Tokenizer training (BPE, WordPiece, Unigram)
- Checkpoint management with auto-save
- TensorBoard integration
- Live training dashboard
- Interactive chat interface
- Model comparison tools
- Deployment scripts
Intelligent configuration that:
- Auto-detects vocab size from tokenizer
- Warns about model/data size mismatches
- Detects overfitting during training
- Suggests optimal hyperparameters
- Handles cross-platform paths
Optional integrations:
- WandB - Experiment tracking
- HuggingFace - Model sharing
- SynthexAI - Synthetic data generation
You'll be prompted for:
- 📝 Project name
- 🎯 Template (NANO, TINY, SMALL, BASE)
- 🔤 Tokenizer type (BPE, WordPiece, Unigram)
- 🔌 Optional plugins (WandB, HuggingFace, SynthexAI)
Perfect for learning and quick experiments
When to use:
- First time training an LLM
- Quick experiments and testing
- Educational purposes
- Understanding the pipeline
- Limited data (100-1000 examples)
Perfect for prototyping and small projects
When to use:
- Small-scale projects
- Limited data (1K-10K examples)
- Prototyping before scaling
- Personal experiments
- CPU-only environments
Perfect for production applications
When to use:
- Production applications
- Domain-specific models
- Real-world deployments
- Good data availability
- GPU available
Perfect for research and high-quality models
When to use:
- Research projects
- High-quality requirements
- Large datasets available
- Multi-GPU setup
- Competitive performance needed
Place your text files in data/raw/:
💡 Pro Tip: Start with at least 1MB of text for meaningful results
🔤 This creates a vocabulary from your data
📊 This tokenizes and prepares your data for training
📈 Watch your model learn in real-time!
✨ See your model's creativity in action!
💬 Chat with your trained model!
🚀 Share your model with the world!
Everything is controlled via llm.config.js:
| --template <name> | Template to use (nano, tiny, small, base, custom) | Interactive |
| --tokenizer <type> | Tokenizer type (bpe, wordpiece, unigram) | Interactive |
| --skip-install | Skip npm/pip installation | false |
| -y, --yes | Skip all prompts, use defaults | false |
| -h, --help | Show help | - |
| -v, --version | Show version | - |
Monitor training in real-time with a web interface:
Then open http://localhost:5000 to see:
- Real-time loss curves
- Learning rate schedule
- Tokens per second
- GPU memory usage
- Recent checkpoints
Compare multiple trained models:
Shows:
- Side-by-side metrics
- Sample generations
- Performance comparison
- Recommendation
Automatic checkpoint management:
- Saves best model based on validation loss
- Keeps last N checkpoints (configurable)
- Auto-saves on Ctrl+C
- Resume from any checkpoint
Create your own plugins:
Minimum Data Requirements:
- NANO: 100+ examples (good for learning)
- TINY: 1,000+ examples (minimum for decent results)
- SMALL: 10,000+ examples (recommended)
- BASE: 100,000+ examples (for quality)
Data Quality:
- Use clean, well-formatted text
- Remove HTML, markdown, or special formatting
- Ensure consistent encoding (UTF-8)
- Remove duplicates
- Balance different content types
Avoid Overfitting:
- Watch for perplexity < 1.5 (warning sign)
- Use validation split (10% recommended)
- Increase dropout if overfitting
- Add more data if possible
- Use smaller model for small datasets
Optimize Training:
- Start with NANO to test pipeline
- Use mixed precision on GPU (mixed_precision: true)
- Increase gradient_accumulation_steps if OOM
- Monitor GPU usage with dashboard
- Save checkpoints frequently
Hyperparameter Tuning:
- Learning rate: Start with 3e-4, adjust if unstable
- Batch size: As large as GPU allows
- Warmup steps: 10% of total steps
- Dropout: 0.1-0.3 depending on data size
Before Deploying:
- Evaluate on held-out test set
- Test generation quality
- Check model size
- Verify inference speed
- Test on target hardware
Deployment Options:
- Hugging Face Hub (easiest)
- Replicate (API endpoint)
- Docker container (custom)
- Cloud platforms (AWS, GCP, Azure)
"Vocab size mismatch detected"
- ✅ This is normal! The tool auto-detects and fixes it
- The model will use the actual tokenizer vocab size
"Model may be too large for dataset"
- ⚠️ Warning: Risk of overfitting
- Solutions: Add more data, use smaller template, increase dropout
"Perplexity < 1.1 indicates severe overfitting"
- ❌ Model memorized the data
- Solutions: Add much more data, use smaller model, increase regularization
"CUDA out of memory"
- Reduce batch_size in llm.config.js
- Enable mixed_precision: true
- Increase gradient_accumulation_steps
- Use smaller model template
"Tokenizer not found"
- Run python tokenizer/train.py --data data/raw/ first
- Make sure data/raw/ contains .txt files
"Training loss not decreasing"
- Check learning rate (try 1e-4 to 1e-3)
- Verify data is loading correctly
- Check for data preprocessing issues
- Try longer warmup period
- Node.js 18.0.0 or higher
- npm 8.0.0 or higher
- Python 3.8 or higher
- PyTorch 2.0.0 or higher
- 4GB RAM minimum (NANO/TINY)
- 12GB VRAM recommended (SMALL)
- 40GB+ VRAM for BASE
- ✅ Windows 10/11
- ✅ macOS 10.15+
- ✅ Linux (Ubuntu 20.04+)
| 🐛 Bug Fixes | Fix issues and improve stability | 🟢 Easy |
| 📝 Documentation | Improve guides and examples | 🟢 Easy |
| 🎨 New Templates | Add BERT, T5, custom architectures | 🟡 Medium |
| 🔌 Plugins | Integrate new services | 🟡 Medium |
| 🧪 Testing | Increase test coverage | 🟡 Medium |
| 🌍 i18n | Internationalization support | 🔴 Hard |
- More model architectures (BERT, T5)
- Distributed training support
- Model quantization tools
- Fine-tuning templates
- Web UI for project management
- Automatic hyperparameter tuning
- Model compression tools
- More deployment targets
- Multi-modal support
- Reinforcement learning from human feedback
- Advanced optimization techniques
- Cloud training integration
MIT © Aniket Giri
See LICENSE for more information.
Built with amazing open-source tools:
- PyTorch - Deep learning framework
- Transformers - Model implementations
- Tokenizers - Fast tokenization
- Commander.js - CLI framework
- Inquirer.js - Interactive prompts
Special thanks to the LLM community for inspiration and feedback.
.png)


