Run Qwen3-Coder-480B-A35B Locally with Unsloth Dynamic Quants

3 months ago 2

Qwen3-Coder-480B-A35B delivers SOTA advancements in agentic coding and code tasks, matching or outperforming Claude Sonnet-4, GPT-4.1, and Kimi K2. The 480B model achieves a 61.8% on Aider Polygot and supports a 256K token context, extendable to 1M tokens.

All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.

We also uploaded Qwen3 with native 1M context length extended by YaRN and unquantized 8bit and 16bit versions.

Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:

Dynamic 2.0 GGUF (to run)

1M Context Dynamic 2.0 GGUF

🖥️ Running Qwen3 Coder

According to Qwen, these are the recommended settings for inference:

temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05

  • Temperature of 0.7

  • Top_K of 20

  • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Top_P of 0.8

  • Repetition Penalty of 1.05

  • Chat template:

    <|im_start|>user Hey there!<|im_end|> <|im_start|>assistant What is 1+1?<|im_end|> <|im_start|>user 2<|im_end|> <|im_start|>assistant
  • Recommended context output: 65,536 tokens (can be increased). Details here.

Chat template/prompt format with newlines un-rendered

<|im_start|>user\nHey there!<|im_end|>\n<|im_start|>assistant\nWhat is 1+1?<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n

Chat template for tool calling (Getting the current temperature for San Francisco). More details here for how to format tool calls.

<|im_start|>user What's the temperature in San Francisco now? How about tomorrow?<|im_end|> <|im_start|>assistant <tool_call>\n<function=get_current_temperature>\n<parameter=location>\nSan Francisco, CA, USA </parameter>\n</function>\n</tool_call><|im_end|> <|im_start|>user <tool_response> {"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"} </tool_response>\n<|im_end|>

📖 Llama.cpp: Run Qwen3 Tutorial

For Coder-480B-A35B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions!

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
  2. You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

    ./llama.cpp/llama-cli \ -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \ --min-p 0.0 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05
  3. Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )
  4. Run the model in conversation mode and try any prompt.

  5. Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.

./llama.cpp/llama-cli \ --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \ --min-p 0.0 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05

🛠️ Improving generation speed

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

📐How to fit long context (256K to 1M)

To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.

We also uploaded 1 million context length GGUFs via YaRN scaling here.

To format the prompts for tool calling, let's showcase it with an example.

I created a Python function called get_current_temperature which is a function which should get the current temperature for a location. For now we created a placeholder function which will always return 21.6 degrees celsius. You should change this to a true function!!

def get_current_temperature(location: str, unit: str = "celsius"): """Get current temperature at a location. Args: location: The location to get the temperature for, in the format "City, State, Country". unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"]) Returns: the temperature, the location, and the unit in a dict """ return { "temperature": 26.1, # PRE_CONFIGURED -> you change this! "location": location, "unit": unit, }

Then use the tokenizer to create the entire prompt:

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct") messages = [ {'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"}, {'content': "", 'role': 'assistant', 'function_call': None, 'tool_calls': [ {'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'}, ]}, {'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'}, ] prompt = tokenizer.apply_chat_template(messages, tokenize = False)

💡Performance Benchmarks

Here are the benchmarks for the 480B model:

Agentic Coding

Benchmark

Qwen3‑Coder 40B‑A35B‑Instruct

Kimi‑K2

DeepSeek‑V3-0324

Claude 4 Sonnet

GPT‑4.1

SWE‑bench Verified w/ OpenHands (500 turns)

SWE‑bench Verified w/ OpenHands (100 turns)

SWE‑bench Verified w/ Private Scaffolding

Agentic Browser Use

Benchmark

Qwen3‑Coder 40B‑A35B‑Instruct

Kimi‑K2

DeepSeek‑V3 0324

Claude Sonnet‑4

GPT‑4.1

Agentic Tool -Use

Benchmark

Qwen3‑Coder 40B‑A35B‑Instruct

Kimi‑K2

DeepSeek‑V3 0324

Claude Sonnet‑4

GPT‑4.1

Read Entire Article