Qwen3-Coder-480B-A35B delivers SOTA advancements in agentic coding and code tasks, matching or outperforming Claude Sonnet-4, GPT-4.1, and Kimi K2. The 480B model achieves a 61.8% on Aider Polygot and supports a 256K token context, extendable to 1M tokens.
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.
We also uploaded Qwen3 with native 1M context length extended by YaRN and unquantized 8bit and 16bit versions.
Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:
Dynamic 2.0 GGUF (to run)
1M Context Dynamic 2.0 GGUF
🖥️ Running Qwen3 Coder
⚙️ Official Recommended Settings
According to Qwen, these are the recommended settings for inference:
temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05
Temperature of 0.7
Top_K of 20
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.8
Repetition Penalty of 1.05
Chat template:
<|im_start|>user Hey there!<|im_end|> <|im_start|>assistant What is 1+1?<|im_end|> <|im_start|>user 2<|im_end|> <|im_start|>assistantRecommended context output: 65,536 tokens (can be increased). Details here.
Chat template/prompt format with newlines un-rendered
Chat template for tool calling (Getting the current temperature for San Francisco). More details here for how to format tool calls.
📖 Llama.cpp: Run Qwen3 Tutorial
For Coder-480B-A35B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions!
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cppYou can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:
./llama.cpp/llama-cli \ -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \ --min-p 0.0 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..
# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )Run the model in conversation mode and try any prompt.
Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.
🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (256K to 1M)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.
We also uploaded 1 million context length GGUFs via YaRN scaling here.
To format the prompts for tool calling, let's showcase it with an example.
I created a Python function called get_current_temperature which is a function which should get the current temperature for a location. For now we created a placeholder function which will always return 21.6 degrees celsius. You should change this to a true function!!
Then use the tokenizer to create the entire prompt:
💡Performance Benchmarks
Here are the benchmarks for the 480B model:
Agentic Coding
Benchmark
Qwen3‑Coder 40B‑A35B‑Instruct
Kimi‑K2
DeepSeek‑V3-0324
Claude 4 Sonnet
GPT‑4.1
SWE‑bench Verified w/ OpenHands (500 turns)
SWE‑bench Verified w/ OpenHands (100 turns)
SWE‑bench Verified w/ Private Scaffolding
Agentic Browser Use
Benchmark
Qwen3‑Coder 40B‑A35B‑Instruct
Kimi‑K2
DeepSeek‑V3 0324
Claude Sonnet‑4
GPT‑4.1
Agentic Tool -Use
Benchmark
Qwen3‑Coder 40B‑A35B‑Instruct
Kimi‑K2
DeepSeek‑V3 0324
Claude Sonnet‑4
GPT‑4.1
.png)
