GPT from scratch. Just NumPy and Python.
Understanding comes from building. This repo implements the core pieces of neural networks - modules, tokenizers, optimizers, backpropagation - using only NumPy. No autograd, no tensor abstractions. Every gradient computation is explicit.
core modules:
- Linear, Embedding, LayerNorm, Softmax, ReLU, MultiHeadAttention, FeedForward
 - Adam first-order optimizer
 - Tokenizers char-level, word-level, bpe
 - GPT model (transformer decoder)
 
educational resources:
- BACKPROP.md - what it is and how to implement it from scratch
 - OPTIMIZERS.md - understand the difference between Adam and SGD
 - TOKENIZERS.md - understand the difference between character-level, word-level, and BPE tokenization
 
- Explicit gradients - see exactly how backprop works
 - PyTorch-like API - familiar interface
 - Complete transformer - multi-head attention, feedforward, layer norm
 - Flexible tokenization - character, word-level, or BPE preprocessing
 - Extensive testing - test correctness of forward and backward for every layer
 - Minimal dependencies - just numpy and standard library
 
Perfect for understanding how modern language models actually work.
- pytorch's repo – architecture and API inspiration
 - building micrograd [YT] - backprop from scratch, explained simply
 - micrograd - A tiny scalar-valued autograd engine
 - CNN in Numpy for MNIST - CNN in NumPy for MNIST
 - layerNorm implementation in llm.c (Karpathy's again <3) - layernorm fwd-bwd implementation with torch
 - kaggle's L-layer neural network using numpy - cats/dogs classifier using numpy
 - forward and Backpropagation in Neural Networks using Python - forward + backward pass walkthrough
 
Three ways to represent text, three different models, same Shakespeare. Let's see what happens.
Trained three identical transformer models on Shakespeare, only difference: how we tokenize the text.
| batch_size | 16 | 
| block_size | 128 | 
| max_iters | 8,000 | 
| lr | 3e-4 | 
| min_lr | 3e-5 | 
| n_layer | 4 | 
| n_head | 4 | 
| n_embd | 256 | 
| warmup_iters | 800 | 
| grad_clip | 1.0 | 
One character = one token.
One word = one token.
Split on spaces and punctuation, lowercase everything to limit OOV (i.e., UNK).
Learns frequent character pairs using BPE, builds subwords bottom-up.
| Final Loss | 1.5 ⭐ | 3.0 | 3.0 | 
| Output Readability | ❌ (broken words) | ✅ | ✅ ⭐ | 
| OOV Handling | ✅ | ❌ | ✅ | 
| Semantic Coherence | ❌ | ✅ | ✅ | 
| Character Names | ❌ | ✅ | ✅ | 
| Natural Phrases | ❌ | ✅ | ✅ | 
| Training Speed | Fast → Unstable | Steady | Slow but Stable | 
| Number of chars (500 tokens) | 490 | 1602 ⭐ | 1505 | 
| Number of parameters | 3.23M ⭐ | 6.55M | 6.55M | 
| Embedding-related parameters | 68k (2.11%)⭐ | 3.4M (52%) | 3.4M (52%) | 
Each model comes with a 2×2 panel of plots to track training:
- Top Left: Training and validation loss over time
 - Top Right: Gradient norm (watch for spikes = instability)
 - Bottom Left: Learning rate schedule (warmup + cosine decay)
 - Bottom Right: Validation loss improvement per eval window
 
Asked each model to generate 500 tokens of Shakespeare:
Complete output: bpe.out
Complete output: word.out
(Yes, I know I still have some tokenization issues...)
Complete output: char.out
- Lower loss ≠ better output: Character model had lowest loss but worst readability --> loss is lower because the model is predicting 1 of 69 characters, which is much easier than predicting 1 of 6,551 words/subwords.
 - Number of parameters: Despite having the same architecture configuration, the BPE and word level have a much larger embedding matrix that brings its parameters from 3.2M to 6.55M (52% just embedding-related tokens).
 - Token efficiency matters: Same 500 output tokens generated vastly different text lengths: from ~500 with character, ~1500 with BPE and ~1600 with word.
 - Stability matters: BPE's consistent training beats unstable fast learning
 - The curse of granularity: Finer tokens (char) = easier prediction but harder composition. Coarser tokens (word) = harder prediction but natural composition.
 - There's no free lunch: Each approach trades off different aspects
 
TODOs:
- bpe with byte-fallback
 
.png)
  




![I want a good parallel language [video]](https://www.youtube.com/img/desktop/supported_browsers/chrome.png)
![MCP is the wrong abstraction [video]](https://news.najib.digital/site/assets/img/broken.gif)