gemma-4-31B-it-oQ6

An oQ6 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.

Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.

Key Facts

Property Value
Base Model google/gemma-4-31b-it (31B dense, BF16)
Quantization oQ6 — sensitivity-driven mixed-precision
Effective bpw 6.5
Model Size ~25 GB (vs. 58.3 GB BF16)
Vision ✅ Preserved (vision weights kept in fp16)
Format Standard MLX safetensors
Quantized with oMLX v0.3.4+
Hardware Apple M2 Ultra 128 GB

Why oQ6?

oQ6 fills the gap between the existing oQ4 (18 GB) and oQ8 (31 GB) variants. At ~25 GB, it fits comfortably on 32 GB MacBooks with room for KV cache, while maintaining high quality with vision capabilities intact.

oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. At 6-bit, quality remains very close to BF16 — significantly better than 4-bit, with meaningful speed improvements over 8-bit.

Benchmarks

Tested on Apple M2 Ultra (128 GB, 76 GPU cores) with oMLX. Generation length: 128 tokens.

oQ6 (this model, 25.2 GB)

Test TTFT (ms) TPOT (ms) pp TPS tg TPS E2E (s) Throughput Peak Mem
pp1024/tg128 5,918 52.8 173.0 tok/s 19.1 tok/s 12.6s 91.3 tok/s 25.23 GB
pp4096/tg128 23,220 58.7 176.4 tok/s 17.2 tok/s 30.7s 137.7 tok/s 27.08 GB
pp8192/tg128 46,695 69.9 175.4 tok/s 14.4 tok/s 55.6s 149.7 tok/s 27.39 GB

Continuous Batching (pp1024/tg128)

Batch tg TPS Speedup pp TPS pp TPS/req Avg TTFT (ms) E2E (s)
1x (baseline) 19.1 tok/s 1.00x 173.0 tok/s 173.0 tok/s 5,918 12.6
2x 25.9 tok/s 1.36x 172.4 tok/s 86.2 tok/s 11,669 21.7
4x 30.3 tok/s 1.59x 172.3 tok/s 43.1 tok/s 23,032 40.7

Comparison: oQ6 vs oQ8 vs BF16

Metric oQ6 oQ8 BF16
Size 25 GB 31.4 GB 58.3 GB
Token Generation 19.1 tok/s 17.5 tok/s 10.3 tok/s
Prefill 173 tok/s 177 tok/s 258 tok/s
Peak Memory 25.2 GB 31.8 GB 58.5 GB
vs BF16 size -57% -46% baseline
vs BF16 speed +85% faster +70% faster baseline

oQ6 is the sweet spot for 32 GB Macs: fits with room for KV cache, 9% faster generation than oQ8, and quality remains close to near-lossless.

Usage

oMLX

Drop the model folder into your oMLX models directory. Auto-detected on server start.

mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mpe74/gemma-4-31B-it-oQ6")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

Recommended Sampling Parameters

Parameter Value
temperature 1.0
top_p 0.95
top_k 64

LM Studio

Search for the model and download. Works with MLX backend on Apple Silicon.

Quantization Details

Parameter Value
oQ Level oQ6
Effective bpw ~6.5
Mode Affine quantization
Group size 64
Sensitivity model Source model (google/gemma-4-31b-it BF16)
Calibration data Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning)
Vision weights Preserved in fp16

Also Available


Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).

Downloads last month
86
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support