Kimi K2: How to Run Locally

3 months ago 3

Kimi K2 is the world’s most powerful open-source model, setting new SOTA performance in knowledge, reasoning, coding, and agentic tasks. The full 1T parameter model from Moonshot AI requires 1.09TB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces this to just 245GB (-80% size): Kimi-K2-GGUF

All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run quantized LLMs with minimal accuracy loss.

Run in llama.cpp

⚙️ Recommended Settings

You need 250GB of disk space at least to run the 1bit quant!

The only requirement is disk space + RAM + VRAM ≥ 250GB. That means you do not need to have that much RAM or VRAM (GPU) to run the model, but it will just be slower.

The 1.8-bit (UD-TQ1_0) quant will fit in a 1x 24GB GPU (with all MoE layers offloaded to system RAM or a fast disk). Expect around 5 tokens/s with this setup if you have bonus 256GB RAM as well. The full Kimi K2 Q8 quant is 1.09TB in size and will need at least 8 x H200 GPUs.

For optimal performance you will need at least 250GB unified memory or 250GB combined RAM+VRAM for 5+ tokens/s. If you have less than 250GB combined RAM+VRAM, then the speed of the model will definitely take a hit.

If you do not have 250GB of RAM+VRAM, no worries! llama.cpp inherently has disk offloading, so through mmaping, it'll still work, just be slower - for example before you might get 5 to 10 tokens / second, now it's under 1 token.

We suggest using our UD-Q2_K_XL (381GB) quant to balance size and accuracy!

For the best performance, have your VRAM + RAM combined = the size of the quant you're downloading. If not, it'll still work via disk offloading, just it'll be slower!

🌙 Official Recommended Settings:

According to Moonshot AI, these are the recommended settings for Kimi K2 inference:

Set the temperature 0.6 to reduce repetition and incoherence.

We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities.

🔢 Chat template and prompt format

To separate the conversational boundaries (you must remove each new line), we get:

💾 Model uploads

ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and reasoning tasks.

MoE Bits

Type + Link

Disk Size

Details

We've also uploaded versions in BF16 format.

🐢Run Kimi K2 Tutorials:

✨ Run in llama.cpp

Obtain the latest llama.cpp on GitHub here and switch to the PR 14654 or you can use our current fork which should also work. We expect mainline llama.cpp to integrate full support in the next few days! You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/unslothai/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:UD-IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

export LLAMA_CACHE="unsloth/Kimi-K2-Instruct-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/Kimi-K2-Instruct-GGUF:TQ1_0 \ --cache-type-k q4_0 \ --threads -1 \ --n-gpu-layers 99 \ --temp 0.6 \ --min_p 0.01 \ --ctx-size 16384 \ --seed 3407 \ -ot ".ffn_.*_exps.=CPU"

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-TQ1_0(dynamic 1.8bit quant) or other quantized versions like Q2_K_XL . We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at: huggingface.co/unsloth/Kimi-K2-Instruct-GGUF

# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Kimi-K2-Instruct-GGUF", local_dir = "unsloth/Kimi-K2-Instruct-GGUF", allow_patterns = ["*UD-TQ1_0*"], # Dynamic 1bit (281GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (381GB) )

Run any prompt.
Edit --threads -1 for the number of CPU threads (be default it's set to the maximum CPU threads), --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Set it to 99 combined with MoE CPU offloading to get the best performance. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \ --model unsloth/Kimi-K2-Instruct-GGUF/UD-TQ1_0/Kimi-K2-Instruct-UD-TQ1_0-00001-of-00005.gguf \ --cache-type-k q4_0 \ --threads -1 \ --n-gpu-layers 99 \ --temp 0.6 \ --min_p 0.01 \ --ctx-size 16384 \ --seed 3407 \ -ot ".ffn_.*_exps.=CPU" \ -no-cnv \ --prompt "<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|><|im_assistant|>assistant<|im_middle|>"

🔍Tokenizer quirks and bug fixes

The Kimi K2 tokenizer was interesting to play around with - it's mostly similar in action to GPT-4o's tokenizer! We first see in the tokenization_kimi.py file the following regular expression (regex) that Kimi K2 uses:

pat_str = "|".join( [ r"""[\p{Han}]+""", r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""", r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""", r"""\p{N}{1,3}""", r""" ?[^\s\p{L}\p{N}]+[\r\n]*""", r"""\s*[\r\n]+""", r"""\s+(?!\S)""", r"""\s+""", ] )

After careful inspection, we find Kimi K2 is nearly identical to GPT-4o's tokenizer regex which can be found in llama.cpp's source code.

[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+

Both tokenize numbers into groups of 1 to 3 numbers (9, 99, 999), and use similar patterns. The only difference looks to be the handling of "Han" or Chinese characters, which Kimi's tokenizer deals with more. The PR by https://github.com/gabriellarson handles these differences well after some discussions here.

We also find the correct EOS token should not be [EOS], but rather <|im_end|>, which we have also fixed in our model conversions.

🐦 Flappy Bird + other tests

We introduced the Flappy Bird test when our 1.58bit quants for DeepSeek R1 were provided. We found Kimi K2 one of the only models to one-shot all our tasks including this one, Heptagon and others tests even at 2-bit. The goal is to ask the LLM to create a Flappy Bird game but following some specific instructions:

Create a Flappy Bird game in Python. You must include these things: 1. You must use pygame. 2. The background color should be randomly chosen and is a light shade. Start with a light blue color. 3. Pressing SPACE multiple times will accelerate the bird. 4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color. 5. Place on the bottom some land colored as dark brown or yellow chosen randomly. 6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them. 7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade. 8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again. The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

You can also test the dynamic quants via the Heptagon Test as per r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

The goal is to make the heptagon spin, and the balls in the heptagon should move. The prompt is below:

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.

Read Entire Article