Unsloth: GPT-OSS

3 months ago 6

OpenAI releases 'gpt-oss-120b' and 'gpt-oss-20b', two SOTA open language models under the Apache 2.0 license. Both models outperform similarly sized open models in reasoning, tool use, and few-shot tasks, while running efficiently on consumer hardware.

Trained with reinforcement learning (RL) and insights from advanced OpenAI models, gpt-oss-120b rivals o4-mini on reasoning and runs on a single 80 GB GPU. gpt-oss-20b matches o3-mini on benchmarks and fits on 16 GB of memory. Both models excel at function calling and CoT reasoning, surpassing some proprietary models like o1 and GPT-4o.

Run gpt-oss-20bRun gpt-oss-120bFine-tune gpt-oss

gpt-oss - Unsloth Dynamic 2.0 GGUFs:

🖥️ Running gpt-oss

Below are guides for the 20B and 120B variants of the model.

OpenAI recommends these inference settings for both models:

temperature=0.6, top_p=1.0, top_k=0

  • Temperature of 0.6

  • Top_K = 0

  • Top_P = 1.0

Run gpt-oss-20B

To achieve inference speeds of 6+ tokens per second for our Dynamic 4-bit quant, have at least 14GB of unified memory (combined VRAM and RAM) or 14GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-20b-GGUF

NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

🦙 Ollama: gpt-oss-20b Tutorial

  1. Install ollama if you haven't already! You can only run models up to 32B in size.

apt-get update apt-get install pciutils -y curl -fsSL https://ollama.com/install.sh | sh
  1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

ollama run hf.co/unsloth/gpt-oss-20b-GGUF

Llama.cpp: Run gpt-oss-20b Tutorial

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
  1. You can directly pull from HuggingFace via:

    ./llama.cpp/llama-cli \ -hf unsloth/gpt-oss-20b-GGUF:F16 \ --jinja -ngl 99 --threads -1 --ctx-size 32684 \ --temp 0.6 --top-p 1.0 --top-k 0
  2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/gpt-oss-20b-GGUF", local_dir = "unsloth/gpt-oss-20b-GGUF", allow_patterns = ["*F16*"], )

Run gpt-oss-120b:

To achieve inference speeds of 6+ tokens per second for our 1-bit quant, we recommend at least 40GB of unified memory (combined VRAM and RAM) or 40GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. E.g. the Q2_K_XL quant, which is 40GB, will require at least 40GB of unified memory (VRAM + RAM) or 180GB of RAM for optimal performance. GGUF Link: unsloth/gpt-oss-120b-GGUF

NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

📖 Llama.cpp: Run gpt-oss-120b Tutorial

For gpt-oss-120b, we will specifically use Llama.cpp for optimized inference.

If you want a full precision unquantized version, use our F16 versions!

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
  2. You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

    ./llama.cpp/llama-cli \ -hf unsloth/gpt-oss-120b:F16 \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.6 \ --min-p 0.0 \ --top-p 1.0 \ --top-k 0.0 \
  3. Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/gpt-oss-120b-GGUF", local_dir = "unsloth/gpt-oss-120b-GGUF", allow_patterns = ["*F16*"], )
  4. Run the model in conversation mode and try any prompt.

  5. Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.

./llama.cpp/llama-cli \ --model unsloth/gpt-oss-120b-GGUF/F16/ \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.6 \ --min-p 0.0 \ --top-p 1.0 \ --top-k 0.0 \

🛠️ Improving generation speed

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

📐How to fit long context (256K to 1M)

To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.

We also uploaded 1 million context length GGUFs via YaRN scaling here.

🦥 Fine-tuning gpt-oss with Unsloth

Unsloth makes gpt-oss fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Qwen3 (20B) fits comfortably in a Google Colab 16GB VRAM Tesla T4 GPU. We're working hard to support the models via Unsloth!

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
Read Entire Article