From Scratch GPT Built with NumPy (Tokenizer, Model, Adam)

4 months ago 13

GPT from scratch. Just NumPy and Python.

alt text

Understanding comes from building. This repo implements the core pieces of neural networks - modules, tokenizers, optimizers, backpropagation - using only NumPy. No autograd, no tensor abstractions. Every gradient computation is explicit.

pip install numpy # Tokenize data (available char-level, word-level, or subword with BPE) ./datagen.py # Train a GPT model ./train.py # Generate text from trained model ./sample.py # Plot training curves ./plot.py # Test the implementation ./test.py

core modules:

  • Linear, Embedding, LayerNorm, Softmax, ReLU, MultiHeadAttention, FeedForward
  • Adam first-order optimizer
  • Tokenizers char-level, word-level, bpe
  • GPT model (transformer decoder)

educational resources:

  • BACKPROP.md - what it is and how to implement it from scratch
  • OPTIMIZERS.md - understand the difference between Adam and SGD
  • TOKENIZERS.md - understand the difference between character-level, word-level, and BPE tokenization
# Every layer follows this pattern class Linear: def __init__(self, in_features, out_features): self.W = np.random.randn(in_features, out_features) * 0.02 self.b = np.zeros(out_features) def forward(self, X): self.X = X # cache for backward return X @ self.W + self.b def backward(self, dY): # dY: gradient flowing back from next layer self.dW = self.X.T @ dY # gradient w.r.t weights self.db = np.sum(dY, axis=0) # gradient w.r.t bias dX = dY @ self.W.T # gradient w.r.t input return dX
numpyGPT/ ├── nn/ │ ├── modules/ # Linear, Embedding, LayerNorm, etc. │ └── functional.py # cross_entropy, softmax, etc. ├── optim/ # Adam optimizer + LR scheduling ├── utils/data/ # DataLoader, Dataset ├── tokenizer/ # Character, word-level & BPE tokenizers └── models/GPT.py # Transformer implementation datagen.py # Data preprocessing train.py # Training script sample.py # Text generation plot.py # Training curves (requires matplotlib) test.py # Test suite
  • Explicit gradients - see exactly how backprop works
  • PyTorch-like API - familiar interface
  • Complete transformer - multi-head attention, feedforward, layer norm
  • Flexible tokenization - character, word-level, or BPE preprocessing
  • Extensive testing - test correctness of forward and backward for every layer
  • Minimal dependencies - just numpy and standard library

Perfect for understanding how modern language models actually work.

resources that I found helpful


Three ways to represent text, three different models, same Shakespeare. Let's see what happens.

Trained three identical transformer models on Shakespeare, only difference: how we tokenize the text.

Parameter Value
batch_size 16
block_size 128
max_iters 8,000
lr 3e-4
min_lr 3e-5
n_layer 4
n_head 4
n_embd 256
warmup_iters 800
grad_clip 1.0
"Hello" → ['H', 'e', 'l', 'l', 'o']

One character = one token.

"Hello world!" → ['hello', 'world', '!']

One word = one token.

Split on spaces and punctuation, lowercase everything to limit OOV (i.e., UNK).

"Hello" → ['H', 'ell', 'o'] # learned subwords

Learns frequent character pairs using BPE, builds subwords bottom-up.

Metric Character Word BPE
Final Loss 1.5 ⭐ 3.0 3.0
Output Readability ❌ (broken words) ✅ ⭐
OOV Handling
Semantic Coherence
Character Names
Natural Phrases
Training Speed Fast → Unstable Steady Slow but Stable
Number of chars (500 tokens) 490 1602 ⭐ 1505
Number of parameters 3.23M ⭐ 6.55M 6.55M
Embedding-related parameters 68k (2.11%)⭐ 3.4M (52%) 3.4M (52%)

Each model comes with a 2×2 panel of plots to track training:

  • Top Left: Training and validation loss over time
  • Top Right: Gradient norm (watch for spikes = instability)
  • Bottom Left: Learning rate schedule (warmup + cosine decay)
  • Bottom Right: Validation loss improvement per eval window

Character Training Curves

Word Training Curves

BPE Training Curves

Asked each model to generate 500 tokens of Shakespeare:

Complete output: bpe.out

KING HENRY PERCY: And kill me Queen Margaret, and tell us not, as I have obstition? NORTHUMBERLAND: Why, then the king's son, And send him that he were,

Complete output: word.out

(Yes, I know I still have some tokenization issues...)

king to uncrown him as to the afternoon of aboard. lady anne: on a day - gone; and, for should romeo be executed in the victory!

Complete output: char.out

KINGAll, and seven dost I, And will beset no specommed a geles, and cond upon you with speaks, but ther so ent the vength
  1. Lower loss ≠ better output: Character model had lowest loss but worst readability --> loss is lower because the model is predicting 1 of 69 characters, which is much easier than predicting 1 of 6,551 words/subwords.
  2. Number of parameters: Despite having the same architecture configuration, the BPE and word level have a much larger embedding matrix that brings its parameters from 3.2M to 6.55M (52% just embedding-related tokens).
  3. Token efficiency matters: Same 500 output tokens generated vastly different text lengths: from ~500 with character, ~1500 with BPE and ~1600 with word.
  4. Stability matters: BPE's consistent training beats unstable fast learning
  5. The curse of granularity: Finer tokens (char) = easier prediction but harder composition. Coarser tokens (word) = harder prediction but natural composition.
  6. There's no free lunch: Each approach trades off different aspects

TODOs:

  • bpe with byte-fallback
Read Entire Article