Instructions to use mpe74/gemma-4-31B-it-oQ6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mpe74/gemma-4-31B-it-oQ6 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mpe74/gemma-4-31B-it-oQ6") config = load_config("mpe74/gemma-4-31B-it-oQ6") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use mpe74/gemma-4-31B-it-oQ6 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mpe74/gemma-4-31B-it-oQ6"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mpe74/gemma-4-31B-it-oQ6" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mpe74/gemma-4-31B-it-oQ6 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mpe74/gemma-4-31B-it-oQ6"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mpe74/gemma-4-31B-it-oQ6
Run Hermes
hermes
gemma-4-31B-it-oQ6
An oQ6 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.
Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.
Key Facts
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31b-it (31B dense, BF16) |
| Quantization | oQ6 — sensitivity-driven mixed-precision |
| Effective bpw | 6.5 |
| Model Size | ~25 GB (vs. 58.3 GB BF16) |
| Vision | ✅ Preserved (vision weights kept in fp16) |
| Format | Standard MLX safetensors |
| Quantized with | oMLX v0.3.4+ |
| Hardware | Apple M2 Ultra 128 GB |
Why oQ6?
oQ6 fills the gap between the existing oQ4 (18 GB) and oQ8 (31 GB) variants. At ~25 GB, it fits comfortably on 32 GB MacBooks with room for KV cache, while maintaining high quality with vision capabilities intact.
oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. At 6-bit, quality remains very close to BF16 — significantly better than 4-bit, with meaningful speed improvements over 8-bit.
Benchmarks
Tested on Apple M2 Ultra (128 GB, 76 GPU cores) with oMLX. Generation length: 128 tokens.
oQ6 (this model, 25.2 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 5,918 | 52.8 | 173.0 tok/s | 19.1 tok/s | 12.6s | 91.3 tok/s | 25.23 GB |
| pp4096/tg128 | 23,220 | 58.7 | 176.4 tok/s | 17.2 tok/s | 30.7s | 137.7 tok/s | 27.08 GB |
| pp8192/tg128 | 46,695 | 69.9 | 175.4 tok/s | 14.4 tok/s | 55.6s | 149.7 tok/s | 27.39 GB |
Continuous Batching (pp1024/tg128)
| Batch | tg TPS | Speedup | pp TPS | pp TPS/req | Avg TTFT (ms) | E2E (s) |
|---|---|---|---|---|---|---|
| 1x (baseline) | 19.1 tok/s | 1.00x | 173.0 tok/s | 173.0 tok/s | 5,918 | 12.6 |
| 2x | 25.9 tok/s | 1.36x | 172.4 tok/s | 86.2 tok/s | 11,669 | 21.7 |
| 4x | 30.3 tok/s | 1.59x | 172.3 tok/s | 43.1 tok/s | 23,032 | 40.7 |
Comparison: oQ6 vs oQ8 vs BF16
| Metric | oQ6 | oQ8 | BF16 |
|---|---|---|---|
| Size | 25 GB | 31.4 GB | 58.3 GB |
| Token Generation | 19.1 tok/s | 17.5 tok/s | 10.3 tok/s |
| Prefill | 173 tok/s | 177 tok/s | 258 tok/s |
| Peak Memory | 25.2 GB | 31.8 GB | 58.5 GB |
| vs BF16 size | -57% | -46% | baseline |
| vs BF16 speed | +85% faster | +70% faster | baseline |
oQ6 is the sweet spot for 32 GB Macs: fits with room for KV cache, 9% faster generation than oQ8, and quality remains close to near-lossless.
Usage
oMLX
Drop the model folder into your oMLX models directory. Auto-detected on server start.
mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("mpe74/gemma-4-31B-it-oQ6")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
Recommended Sampling Parameters
| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 64 |
LM Studio
Search for the model and download. Works with MLX backend on Apple Silicon.
Quantization Details
| Parameter | Value |
|---|---|
| oQ Level | oQ6 |
| Effective bpw | ~6.5 |
| Mode | Affine quantization |
| Group size | 64 |
| Sensitivity model | Source model (google/gemma-4-31b-it BF16) |
| Calibration data | Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning) |
| Vision weights | Preserved in fp16 |
Also Available
- mpe74/gemma-4-31B-it-oQ8 — Near-lossless 8-bit (~31 GB) for 64+ GB Macs
Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).
- Downloads last month
- 86
6-bit