This is an CPU-only inference implementation for the DeepSeek family of large language models written in C++, based on Yet Another Language Model.
For fun and learning!
I was initially adding DeepSeek support to yalm but realized that the changes were large and complex enough that it might ruin the simplicity of that project. Maybe at some point I'll upstream the changes, but for now I've decided to fork them into a separate, smaller, leaner codebase.
Since this program only supports DeepSeek, it's tiny compared to other inference engines (<2k LOC not including fmt and json, vs. >250k for llama.cpp and vllm) and is extra hackable. I'm currently using it as a testbed to study single-batch DeepSeek decoding performance on CPU.
Quantizations other than FP32 require AVX2 and F16C support.
DeepSeek-V2-Lite | ✅ | ✅ | WIP | ✅ | WIP | ✅ | WIP | ✅ |
DeepSeek-V2 | ✅ | ✅ | WIP | ✅ | WIP | ✅ | WIP | ✅ |
DeepSeek-V2.5 | ✅ | ✅ | WIP | ✅ | WIP | ✅ | WIP | ✅ |
DeepSeek-V3 | ✅ | ✅ | WIP | ✅ | WIP | - | - | - |
DeepSeek-R1 | ✅ | ✅ | WIP | ✅ | WIP | - | - | - |
deepseek.cpp is missing important optimizations for production use (see notes below), but gets pretty close to llama.cpp in single-batch decode speed. Benchmarking DeepSeek-V3-Base with Q2_K quantization on an AWS r6a.12xlarge instance (AMD EPYC 7R13, 2x24 cores, 384GB DDR4 RAM):
- llama.cpp (DeepSeek-V3-Q2_K_XS 207GB, tg128, best of 16/24/32/48 threads): 4.57 tok/s
- deepseek.cpp (Q2_K 207GB, MHA, -n 128 -L completion with 16 threads): 4.02 tok/s
A big part of this is that deepseek.cpp uses the llama.cpp vec_dot kernels for Q2_K, so I can't claim to have matched its performance purely through my own ingenuity. But it is surprising given the inference code is much simpler, opting for OpenMP over a global threadpool with spinlock kernel barriers. I'm hoping that in addition to serving as a testbed for myself, this gives a good base for others to hack on.
deepseek.cpp requires a computer with a C++20-compatible compiler. You'll also need a directory containing LLM safetensor weights and configuration files in huggingface format, which you'll need to convert by providing a directory into which .dseek files containing the converted weights will go. Follow the below to download DeepSeek-V2-Lite, build deepseek.cpp, and run it:
See the CLI help documentation below for ./build/main:
You will likely need to tune the number of OpenMP threads to achieve good performance. For example:
The default OpenMP thread count can result in severely degraded throughput, likely due to thread contention. I have found a good heuristic to be half the number of cores.
- --quant=f8e5m2 specifies model weight quantization using 128x128 blocks. MoE gates and layer norms are left in full precision. This should provide better accuracy than per-tensor quantization or the naive truncating quantization done by yalm (which results in nonsensical output for the DeepSeek family of models).
- --quant=q2_k and --quant=q3_k specify model weight quantization using the 2-bit and 3-bit llama.cpp K-quantization schemes, which use a two-level hierarchy of blocks and super-blocks to store scales/biases for ranges of weights.
- The models have a tendency to repeat themselves and get into infinite loops at lower temperatures. In my testing, a temperature of ~1.0 avoids this failure mode but also keeps the models reasonably grounded.
- Some new, optional architectural features (e.g. the noaux_tc method of expert selection) of DeepSeek V3 have not yet been implemented, so the model accuracy may be lower than the reference model.
- You will need ~650GB of memory to run DeepSeek V3 in F8E5M2, or 206GB for 2-bit Q2_K. For best performance, you should ensure there is enough physical RAM available and run as sudo with -L to force weights to stay in RAM, but otherwise, most operating systems will also automatically supplement this with swap space (storing some memory on disk and some in RAM) at the cost of severely degraded token throughput. More aggressive quantization methods such as 1.58-bit are planned.
- Only decoding (e.g. incremental, iterative generation or reading of one token at a time) has been implemented. Prefills (reading a batch of prompt tokens in a single pass) have not been implemented, nor prefill-based optimizations for the decoding phase such as speculative decoding or multi-token prediction. Finally, the current multi-latent attention implementation is still slower than multi-latent attention in surprising scenarios (#8) and appears to be under-utilizing memory bandwidth. I have limited time to implement these optimizations as this is a side project for me, but PRs are welcome!