Kimi K2 is the world’s most powerful open-source model, setting new SOTA performance in knowledge, reasoning, coding, and agentic tasks. The full 1T parameter model from Moonshot AI requires 1.09TB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces this to just 245GB (-80% size): Kimi-K2-GGUF
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run quantized LLMs with minimal accuracy loss.
⚙️ Recommended Settings
You need 250GB of disk space at least to run the 1bit quant!
The only requirement is disk space + RAM + VRAM ≥ 250GB. That means you do not need to have that much RAM or VRAM (GPU) to run the model, but it will just be slower.
The 1.8-bit (UD-TQ1_0) quant will fit in a 1x 24GB GPU (with all MoE layers offloaded to system RAM or a fast disk). Expect around 5 tokens/s with this setup if you have bonus 256GB RAM as well. The full Kimi K2 Q8 quant is 1.09TB in size and will need at least 8 x H200 GPUs.
For optimal performance you will need at least 250GB unified memory or 250GB combined RAM+VRAM for 5+ tokens/s. If you have less than 250GB combined RAM+VRAM, then the speed of the model will definitely take a hit.
If you do not have 250GB of RAM+VRAM, no worries! llama.cpp inherently has disk offloading, so through mmaping, it'll still work, just be slower - for example before you might get 5 to 10 tokens / second, now it's under 1 token.
We suggest using our UD-Q2_K_XL (381GB) quant to balance size and accuracy!
For the best performance, have your VRAM + RAM combined = the size of the quant you're downloading. If not, it'll still work via disk offloading, just it'll be slower!
🌙 Official Recommended Settings:
According to Moonshot AI, these are the recommended settings for Kimi K2 inference:
Set the temperature 0.6 to reduce repetition and incoherence.
We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities.
🔢 Chat template and prompt format
Kimi Chat does use a BOS (beginning of sentence token). The system, user and assistant roles are all enclosed with <|im_middle|> which is interesting, and each get their own respective token <|im_system|>, <|im_user|>, <|im_assistant|>.
To separate the conversational boundaries (you must remove each new line), we get:
💾 Model uploads
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and reasoning tasks.
MoE Bits
Type + Link
Disk Size
Details
We've also uploaded versions in BF16 format.
🐢Run Kimi K2 Tutorials:
✨ Run in llama.cpp
Obtain the latest llama.cpp on GitHub here and switch to the PR 14654 or you can use our current fork which should also work. We expect mainline llama.cpp to integrate full support in the next few days! You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:UD-IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.
Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-TQ1_0(dynamic 1.8bit quant) or other quantized versions like Q2_K_XL . We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at: huggingface.co/unsloth/Kimi-K2-Instruct-GGUF
Run any prompt.
Edit --threads -1 for the number of CPU threads (be default it's set to the maximum CPU threads), --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Set it to 99 combined with MoE CPU offloading to get the best performance. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
🔍Tokenizer quirks and bug fixes
The Kimi K2 tokenizer was interesting to play around with - it's mostly similar in action to GPT-4o's tokenizer! We first see in the tokenization_kimi.py file the following regular expression (regex) that Kimi K2 uses:
After careful inspection, we find Kimi K2 is nearly identical to GPT-4o's tokenizer regex which can be found in llama.cpp's source code.
Both tokenize numbers into groups of 1 to 3 numbers (9, 99, 999), and use similar patterns. The only difference looks to be the handling of "Han" or Chinese characters, which Kimi's tokenizer deals with more. The PR by https://github.com/gabriellarson handles these differences well after some discussions here.
We also find the correct EOS token should not be [EOS], but rather <|im_end|>, which we have also fixed in our model conversions.
🐦 Flappy Bird + other tests
We introduced the Flappy Bird test when our 1.58bit quants for DeepSeek R1 were provided. We found Kimi K2 one of the only models to one-shot all our tasks including this one, Heptagon and others tests even at 2-bit. The goal is to ask the LLM to create a Flappy Bird game but following some specific instructions:
You can also test the dynamic quants via the Heptagon Test as per r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

The goal is to make the heptagon spin, and the balls in the heptagon should move. The prompt is below:
.png)

