Browse Main Chapter Code
- Setup recommendations
- Ch 1: Understanding Large Language Models
- Ch 2: Working with Text Data
- Ch 3: Coding Attention Mechanisms
- Ch 4: Implementing a GPT Model from Scratch
- Ch 5: Pretraining on Unlabeled Data
- Ch 6: Finetuning for Text Classification
- Ch 7: Finetuning to Follow Instructions
- Appendix A: Introduction to PyTorch
- Appendix B: References and Further Reading
- Appendix C: Exercise Solutions
- Appendix D: Adding Bells and Whistles to the Training Loop
- Appendix E: Parameter-efficient Finetuning with LoRA
- Setup
- Chapter 2: Working with text data
- Chapter 3: Coding attention mechanisms
- Chapter 4: Implementing a GPT model from scratch
-
Chapter 5: Pretraining on unlabeled data
- Alternative Weight Loading Methods
- Pretraining GPT on the Project Gutenberg Dataset
- Adding Bells and Whistles to the Training Loop
- Optimizing Hyperparameters for Pretraining
- Building a User Interface to Interact With the Pretrained LLM
- Converting GPT to Llama
- Llama 3.2 From Scratch
- Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch
- Gemma 3 From Scratch
- Memory-efficient Model Weight Loading
- Extending the Tiktoken BPE Tokenizer with New Tokens
- PyTorch Performance Tips for Faster LLM Training
- Chapter 6: Finetuning for classification
-
Chapter 7: Finetuning to follow instructions
- Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries
- Evaluating Instruction Responses Using the OpenAI API and Ollama
- Generating a Dataset for Instruction Finetuning
- Improving a Dataset for Instruction Finetuning
- Generating a Preference Dataset with Llama 3.1 70B and Ollama
- Direct Preference Optimization (DPO) for LLM Alignment
- Building a User Interface to Interact With the Instruction Finetuned GPT Model
- Qwen3 (from scratch) basics
- Evaluation
Main Chapter Code
- 01_main-chapter-code contains the main chapter code.
Bonus Materials
- 02_performance-analysis contains optional code analyzing the performance of the GPT model(s) implemented in the main chapter
- 03_kv-cache implements a KV cache to speed up the text generation during inference
- ch05/07_gpt_to_llama contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI (it might be interesting to look at alternative architectures after completing chapter 4, but you can also save that for after reading chapter 5)
Attention Alternatives

- 04_gqa contains an introduction to Grouped-Query Attention (GQA), which is used by most modern LLMs (Llama 4, gpt-oss, Qwen3, Gemma 3, and many more) as alternative to regular Multi-Head Attention (MHA)
- 05_mla contains an introduction to Multi-Head Latent Attention (MLA), which is used by DeepSeek V3, as alternative to regular Multi-Head Attention (MHA)
- 06_swa contains an introduction to Sliding Window Attention (SWA), which is used by Gemma 3 and others
More
In the video below, I provide a code-along session that covers some of the chapter contents as supplementary material.
.png)



