gemma-4-31B-it-oQ6

An oQ6 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.

Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.

Key Facts

Property	Value
Base Model	google/gemma-4-31b-it (31B dense, BF16)
Quantization	oQ6 — sensitivity-driven mixed-precision
Effective bpw	6.5
Model Size	~25 GB (vs. 58.3 GB BF16)
Vision	✅ Preserved (vision weights kept in fp16)
Format	Standard MLX safetensors
Quantized with	oMLX v0.3.4+
Hardware	Apple M2 Ultra 128 GB

Why oQ6?

oQ6 fills the gap between the existing oQ4 (~~18 GB) and oQ8 (~~31 GB) variants. At ~25 GB, it fits comfortably on 32 GB MacBooks with room for KV cache, while maintaining high quality with vision capabilities intact.

oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. At 6-bit, quality remains very close to BF16 — significantly better than 4-bit, with meaningful speed improvements over 8-bit.

Benchmarks

Tested on Apple M2 Ultra (128 GB, 76 GPU cores) with oMLX. Generation length: 128 tokens.

oQ6 (this model, 25.2 GB)

Test	TTFT (ms)	TPOT (ms)	pp TPS	tg TPS	E2E (s)	Throughput	Peak Mem
pp1024/tg128	5,918	52.8	173.0 tok/s	19.1 tok/s	12.6s	91.3 tok/s	25.23 GB
pp4096/tg128	23,220	58.7	176.4 tok/s	17.2 tok/s	30.7s	137.7 tok/s	27.08 GB
pp8192/tg128	46,695	69.9	175.4 tok/s	14.4 tok/s	55.6s	149.7 tok/s	27.39 GB

Continuous Batching (pp1024/tg128)

Batch	tg TPS	Speedup	pp TPS	pp TPS/req	Avg TTFT (ms)	E2E (s)
1x (baseline)	19.1 tok/s	1.00x	173.0 tok/s	173.0 tok/s	5,918	12.6
2x	25.9 tok/s	1.36x	172.4 tok/s	86.2 tok/s	11,669	21.7
4x	30.3 tok/s	1.59x	172.3 tok/s	43.1 tok/s	23,032	40.7

Comparison: oQ6 vs oQ8 vs BF16

Metric	oQ6	oQ8	BF16
Size	25 GB	31.4 GB	58.3 GB
Token Generation	19.1 tok/s	17.5 tok/s	10.3 tok/s
Prefill	173 tok/s	177 tok/s	258 tok/s
Peak Memory	25.2 GB	31.8 GB	58.5 GB
vs BF16 size	-57%	-46%	baseline
vs BF16 speed	+85% faster	+70% faster	baseline

oQ6 is the sweet spot for 32 GB Macs: fits with room for KV cache, 9% faster generation than oQ8, and quality remains close to near-lossless.

Usage

oMLX

Drop the model folder into your oMLX models directory. Auto-detected on server start.

mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mpe74/gemma-4-31B-it-oQ6")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

Recommended Sampling Parameters

Parameter	Value
temperature	1.0
top_p	0.95
top_k	64

LM Studio

Search for the model and download. Works with MLX backend on Apple Silicon.

Quantization Details

Parameter	Value
oQ Level	oQ6
Effective bpw	~6.5
Mode	Affine quantization
Group size	64
Sensitivity model	Source model (google/gemma-4-31b-it BF16)
Calibration data	Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning)
Vision weights	Preserved in fp16

Also Available

mpe74/gemma-4-31B-it-oQ8 — Near-lossless 8-bit (~31 GB) for 64+ GB Macs

Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).

Downloads last month: 86

Safetensors

Model size

7B params

Tensor type

BF16

U32

MLX

Hardware compatibility

6-bit