Instructions to use steampunque/gpt-oss-20b-MP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use steampunque/gpt-oss-20b-MP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="steampunque/gpt-oss-20b-MP-GGUF",
	filename="gpt-oss-20b.MXFP4_H.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use steampunque/gpt-oss-20b-MP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf steampunque/gpt-oss-20b-MP-GGUF
# Run inference directly in the terminal:
llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf steampunque/gpt-oss-20b-MP-GGUF
# Run inference directly in the terminal:
llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf steampunque/gpt-oss-20b-MP-GGUF
# Run inference directly in the terminal:
./llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf steampunque/gpt-oss-20b-MP-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF

Use Docker

docker model run hf.co/steampunque/gpt-oss-20b-MP-GGUF

LM Studio
Jan
Ollama
How to use steampunque/gpt-oss-20b-MP-GGUF with Ollama:
```
ollama run hf.co/steampunque/gpt-oss-20b-MP-GGUF
```

Unsloth Studio

How to use steampunque/gpt-oss-20b-MP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for steampunque/gpt-oss-20b-MP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for steampunque/gpt-oss-20b-MP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for steampunque/gpt-oss-20b-MP-GGUF to start chatting

How to use steampunque/gpt-oss-20b-MP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf steampunque/gpt-oss-20b-MP-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "steampunque/gpt-oss-20b-MP-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use steampunque/gpt-oss-20b-MP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf steampunque/gpt-oss-20b-MP-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default steampunque/gpt-oss-20b-MP-GGUF

Run Hermes

hermes

Docker Model Runner
How to use steampunque/gpt-oss-20b-MP-GGUF with Docker Model Runner:
```
docker model run hf.co/steampunque/gpt-oss-20b-MP-GGUF
```

Lemonade

How to use steampunque/gpt-oss-20b-MP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull steampunque/gpt-oss-20b-MP-GGUF

Run and chat with the model

lemonade run user.gpt-oss-20b-MP-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Mixed Precision GGUF layer quantization of gpt-oss-20b by openai

Original model: https://huggingface.co/openai/gpt-oss-20b

WARNING: EITHER THIS MODEL or LLAMA.CPP has a major bug as of 08/07/2025. The perplexity evaluation of the model is very bad due to incorrect token probability distribution : https://github.com/ggml-org/llama.cpp/issues/15155 This problem needs to be addressed before the model can be used confidently. Most likely the bug is related to the custom swiglu with clip and/or RMS layer norms for the model being way off, resulting in output probs all very similar and low value and causing generation instability. The entire need for this hybrid quant may be related to this bug so expect the quant to be updated, or even unecessary, once the layer norm problem is resolved.

UPDATE 2/20/26 I went back and tested model again after quite some time and still find it to be unusable with greedy sampling. However, setting TEMP=1.0, TOPK=1 makes a night and day transformation and this MP quant then does what I consider to be a great job across a set of reasoning based eval promps. Thus something about always picking the greedy (highest prob) sample appears to be messing up numerical stability in the autoregressive feedback. I will investigate this effect further as I find time.

The hybrid quant employs different quantization levels on a per layer basis. For this model, the hybrid layer quant is used to help stabilize generation (as much as possible) with greedy decode to allow direct greedy decode for highest probability solutions and/or enable high probability soltuions with lower temp (such as 0.2) to be used.

For this file the layer quants are as follows:

   LAYER_TYPES='[
   [0 ,"MXFP4" ],[1 ,"MXFP4" ],[2 ,"Q8_0"  ],[3 ,"MXFP4" ],[4 ,"MXFP4" ],[5 ,"MXFP4" ],[6 ,"MXFP4" ],[7 ,"MXFP4" ],
   [8 ,"MXFP4" ],[9 ,"MXFP4" ],[10,"MXFP4" ],[11,"MXFP4" ],[12,"MXFP4" ],[13,"MXFP4" ],[14,"MXFP4" ],[15,"MXFP4" ],
   [16,"MXFP4" ],[17,"MXFP4" ],[18,"MXFP4" ],[19,"MXFP4" ],[20,"MXFP4" ],[21,"MXFP4" ],[22,"MXFP4" ],[23,"Q8_0"  ]
   ]'
   FLAGS="--allow-requantize --token-embedding-type Q4_0 --output-tensor-type Q4_0 --layer-types-high"

The layer quants were optimized for stable (as possible) generation using both -ot exps=CPU (model evaluated on CPU) and full cuda offload of the model using 2 4070s and RPC. The homogenous MXFP4 quant with token embedding at Q8_0 and output tensor at Q8_0 results in the model falling into infinite repeat patterns of varying length on most generations when using greedy decode. The primary mechanism used to combat this effect is to add controlled level of nonlinearity by setting token embedding and output tensor both to Q4_0. This somewhat stabilizes both CPU decode and full cuda offload in the presence of the llama.cpp layer norm bug for the model when combined with use a specific system prompt documented below.

Comparison:

Quant	size	PPL	Comment
MXFP4	12.1e9	459	Q8_0 embed and output, massively unstable with greedy sampling
MXFP4_H	12.4e9	300.5	Q4_0 embed Q4_0 output, borderline stable with greedy sampling

The above PPL were computed using llama-perplexity and are a red flag that something major is broke.

Usage:

This is a RL trained moe thinking model. The model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU to open up very large context space. It can also run fully offloaded on GPU via RPC or high VRAM GPU.

The model has not been tested with speculation, but is pretty fast for both CPU and GPU inference mode due to its being a moe:

Config	non speculated gen speed
2 4070, RPC, fully offloaded to GPU	62 t/s
1 4070, -ot exps=CPU, CPU=9900k	18 t/s

System prompt:

A system prompt is needed to be used with this model. The following system prompt was found to be necessary to help stop generation instability and block tool calls, along with the hybrid layer quant. The prompt defined below in shell syntax is recommend to be used, verbatim, together with the quant:

if [[ ! $EFFORT ]]; then
   EFFORT=medium
fi

SYSTEM="Knowledge cutoff: 2024-06
Current date: 2025-??-??

Reasoning: $EFFORT

Never use tool calls in any responses.
"

Further tests show this system prompt also works well combined with the hybrid quant:

SYSTEM="Knowledge cutoff: 2024-06
Current date: 2025-??-??

Reasoning: $EFFORT

Do not use tool calls.
"

The trailing nl is signficant and makes a difference in stabilizing the output as the model appears to be right on the fringe of instability even using the hybrid layer quant. This system prompt voodoo helps kick good initial numbers into the autoregressive feedback to bootstrap the buggy metastable model into good generations which (mostly, but not always) don't go into rep loops.

For deterministic outputs do not enter the current date, leave it as ??-?? so the generation does not change when the date changes. This model will also output tool calls by default, so the system prompt is used to shut that off if the inference platform does not support the openai syntax tool calls.

ROPE:

The model uses ROPE YARN to extend context. It is known that use of ROPE with long contexts degrades inference performance. Therefore the following configuration for ROPE can be used with a context sized at 32k tokens which should be more than adequate for most problems:

--rope-scaling yarn --rope-scale 8 --yarn-orig-ctx 4096

If context <32k is used, then set rope scale to the value context_length / 4096 (example, 8192 context would be 2.0)

Long context test:

A long context problem of 85k tokens was given to the model and found to be unusably slow for both prompt processing of the 85k prompt and subsequent generation, which promptly went into a rep loop due to borderline instability of model. Llama.cpp b6100 was used for test. More info on slow processing: https://github.com/ggml-org/llama.cpp/issues/15163

Benchmarks:

Evals for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm.