Instructions to use steampunque/gpt-oss-20b-MP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use steampunque/gpt-oss-20b-MP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="steampunque/gpt-oss-20b-MP-GGUF", filename="gpt-oss-20b.MXFP4_H.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use steampunque/gpt-oss-20b-MP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/gpt-oss-20b-MP-GGUF # Run inference directly in the terminal: llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf steampunque/gpt-oss-20b-MP-GGUF # Run inference directly in the terminal: llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf steampunque/gpt-oss-20b-MP-GGUF # Run inference directly in the terminal: ./llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf steampunque/gpt-oss-20b-MP-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf steampunque/gpt-oss-20b-MP-GGUF
Use Docker
docker model run hf.co/steampunque/gpt-oss-20b-MP-GGUF
- LM Studio
- Jan
- Ollama
How to use steampunque/gpt-oss-20b-MP-GGUF with Ollama:
ollama run hf.co/steampunque/gpt-oss-20b-MP-GGUF
- Unsloth Studio
How to use steampunque/gpt-oss-20b-MP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gpt-oss-20b-MP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for steampunque/gpt-oss-20b-MP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for steampunque/gpt-oss-20b-MP-GGUF to start chatting
- Pi
How to use steampunque/gpt-oss-20b-MP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/gpt-oss-20b-MP-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "steampunque/gpt-oss-20b-MP-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use steampunque/gpt-oss-20b-MP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf steampunque/gpt-oss-20b-MP-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default steampunque/gpt-oss-20b-MP-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use steampunque/gpt-oss-20b-MP-GGUF with Docker Model Runner:
docker model run hf.co/steampunque/gpt-oss-20b-MP-GGUF
- Lemonade
How to use steampunque/gpt-oss-20b-MP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull steampunque/gpt-oss-20b-MP-GGUF
Run and chat with the model
lemonade run user.gpt-oss-20b-MP-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Mixed Precision GGUF layer quantization of gpt-oss-20b by openai
Original model: https://huggingface.co/openai/gpt-oss-20b
WARNING: EITHER THIS MODEL or LLAMA.CPP has a major bug as of 08/07/2025. The perplexity evaluation of the model is very bad due to incorrect token probability distribution : https://github.com/ggml-org/llama.cpp/issues/15155 This problem needs to be addressed before the model can be used confidently. Most likely the bug is related to the custom swiglu with clip and/or RMS layer norms for the model being way off, resulting in output probs all very similar and low value and causing generation instability. The entire need for this hybrid quant may be related to this bug so expect the quant to be updated, or even unecessary, once the layer norm problem is resolved.
UPDATE 2/20/26 I went back and tested model again after quite some time and still find it to be unusable with greedy sampling. However, setting TEMP=1.0, TOPK=1 makes a night and day transformation and this MP quant then does what I consider to be a great job across a set of reasoning based eval promps. Thus something about always picking the greedy (highest prob) sample appears to be messing up numerical stability in the autoregressive feedback. I will investigate this effect further as I find time.
The hybrid quant employs different quantization levels on a per layer basis. For this model, the hybrid layer quant is used to help stabilize generation (as much as possible) with greedy decode to allow direct greedy decode for highest probability solutions and/or enable high probability soltuions with lower temp (such as 0.2) to be used.
For this file the layer quants are as follows:
LAYER_TYPES='[
[0 ,"MXFP4" ],[1 ,"MXFP4" ],[2 ,"Q8_0" ],[3 ,"MXFP4" ],[4 ,"MXFP4" ],[5 ,"MXFP4" ],[6 ,"MXFP4" ],[7 ,"MXFP4" ],
[8 ,"MXFP4" ],[9 ,"MXFP4" ],[10,"MXFP4" ],[11,"MXFP4" ],[12,"MXFP4" ],[13,"MXFP4" ],[14,"MXFP4" ],[15,"MXFP4" ],
[16,"MXFP4" ],[17,"MXFP4" ],[18,"MXFP4" ],[19,"MXFP4" ],[20,"MXFP4" ],[21,"MXFP4" ],[22,"MXFP4" ],[23,"Q8_0" ]
]'
FLAGS="--allow-requantize --token-embedding-type Q4_0 --output-tensor-type Q4_0 --layer-types-high"
The layer quants were optimized for stable (as possible) generation using both -ot exps=CPU (model evaluated on CPU) and full cuda offload of the model using 2 4070s and RPC. The homogenous MXFP4 quant with token embedding at Q8_0 and output tensor at Q8_0 results in the model falling into infinite repeat patterns of varying length on most generations when using greedy decode. The primary mechanism used to combat this effect is to add controlled level of nonlinearity by setting token embedding and output tensor both to Q4_0. This somewhat stabilizes both CPU decode and full cuda offload in the presence of the llama.cpp layer norm bug for the model when combined with use a specific system prompt documented below.
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| MXFP4 | 12.1e9 | 459 | Q8_0 embed and output, massively unstable with greedy sampling |
| MXFP4_H | 12.4e9 | 300.5 | Q4_0 embed Q4_0 output, borderline stable with greedy sampling |
The above PPL were computed using llama-perplexity and are a red flag that something major is broke.
Usage:
This is a RL trained moe thinking model. The model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU to open up very large context space. It can also run fully offloaded on GPU via RPC or high VRAM GPU.
The model has not been tested with speculation, but is pretty fast for both CPU and GPU inference mode due to its being a moe:
| Config | non speculated gen speed |
|---|---|
| 2 4070, RPC, fully offloaded to GPU | 62 t/s |
| 1 4070, -ot exps=CPU, CPU=9900k | 18 t/s |
System prompt:
A system prompt is needed to be used with this model. The following system prompt was found to be necessary to help stop generation instability and block tool calls, along with the hybrid layer quant. The prompt defined below in shell syntax is recommend to be used, verbatim, together with the quant:
if [[ ! $EFFORT ]]; then
EFFORT=medium
fi
SYSTEM="Knowledge cutoff: 2024-06
Current date: 2025-??-??
Reasoning: $EFFORT
Never use tool calls in any responses.
"
Further tests show this system prompt also works well combined with the hybrid quant:
SYSTEM="Knowledge cutoff: 2024-06
Current date: 2025-??-??
Reasoning: $EFFORT
Do not use tool calls.
"
The trailing nl is signficant and makes a difference in stabilizing the output as the model appears to be right on the fringe of instability even using the hybrid layer quant. This system prompt voodoo helps kick good initial numbers into the autoregressive feedback to bootstrap the buggy metastable model into good generations which (mostly, but not always) don't go into rep loops.
For deterministic outputs do not enter the current date, leave it as ??-?? so the generation does not change when the date changes. This model will also output tool calls by default, so the system prompt is used to shut that off if the inference platform does not support the openai syntax tool calls.
ROPE:
The model uses ROPE YARN to extend context. It is known that use of ROPE with long contexts degrades inference performance. Therefore the following configuration for ROPE can be used with a context sized at 32k tokens which should be more than adequate for most problems:
--rope-scaling yarn --rope-scale 8 --yarn-orig-ctx 4096
If context <32k is used, then set rope scale to the value context_length / 4096 (example, 8192 context would be 2.0)
Long context test:
A long context problem of 85k tokens was given to the model and found to be unusably slow for both prompt processing of the 85k prompt and subsequent generation, which promptly went into a rep loop due to borderline instability of model. Llama.cpp b6100 was used for test. More info on slow processing: https://github.com/ggml-org/llama.cpp/issues/15163
Benchmarks:
Evals for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm.
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| gpt-oss-20b.MXFP4_H.gguf | MXFP4_H | 12.4e9 B | ~MXFP4 size |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 54
We're not able to determine the quantization variants.
Model tree for steampunque/gpt-oss-20b-MP-GGUF
Base model
openai/gpt-oss-20b