Instructions to use ebircak/gemma-4-31B-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ebircak/gemma-4-31B-it-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ebircak/gemma-4-31B-it-GGUF",
	filename="gemma-4-31B-it-IQ4_NL_L_AMD.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ebircak/gemma-4-31B-it-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
# Run inference directly in the terminal:
llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
# Run inference directly in the terminal:
llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
# Run inference directly in the terminal:
./llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Use Docker

docker model run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

LM Studio
Jan

vLLM

How to use ebircak/gemma-4-31B-it-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ebircak/gemma-4-31B-it-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ebircak/gemma-4-31B-it-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Ollama
How to use ebircak/gemma-4-31B-it-GGUF with Ollama:
```
ollama run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
```

Unsloth Studio

How to use ebircak/gemma-4-31B-it-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ebircak/gemma-4-31B-it-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ebircak/gemma-4-31B-it-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ebircak/gemma-4-31B-it-GGUF to start chatting

How to use ebircak/gemma-4-31B-it-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ebircak/gemma-4-31B-it-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Run Hermes

hermes

Docker Model Runner
How to use ebircak/gemma-4-31B-it-GGUF with Docker Model Runner:
```
docker model run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
```

Lemonade

How to use ebircak/gemma-4-31B-it-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L

Run and chat with the model

lemonade run user.gemma-4-31B-it-GGUF-IQ4_NL_L

List all available models

lemonade list

Gemma-4-31B-it GGUF Quantization for AMD RDNA3 (gfx1100)

This is a GGUF quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for AMD Radeon RX 7900 XTX (gfx1100 / RDNA3) GPUs using llama.cpp.

Model Details

Property	Value
Base Model	google/gemma-4-31B-it
Quantization Format	GGUF V3 (latest)
Quantization Type	multiple
Architecture	Gemma4ForConditionalGeneration
Layers	60 decoder layers
Hidden Size	5376
Context Window	262K tokens
Vision Tower	SigLIP (27 layers, F16/Q8_0 mmproj)
Quantization Library	llama.cpp b8703
ROCm Version	6.4.4
Hardware Target	AMD Radeon RX 7900 XTX (gfx1100, 24GB VRAM)

Available Variants

This repository contains quantization variants optimized for different performance/quality tradeoffs:

4-bit Variant L (Quality-Optimized) — `gemma-4-31B-it-IQ4_NL_L_AMD.gguf`

Metric	Value
File Size	~20.11 GB
bpw	~4.5 bits per weight
KLD vs F16	0.647 (lower is better)
PPL (wikitext-2)	~12565
Quant Strategy	IQ4_NL

Best for: Maximum quality deployment where VRAM allows (~20GB) with smaller context or 2-GPU setup.

Variant M (Size-Optimized) — `gemma-4-31B-it-IQ4_NL_M_AMD.gguf`

Metric	Value
File Size	~18.70 GB
bpw	~4.25 bits per weight
KLD vs F16	0.684 (lower is better)
PPL (wikitext-2)	~12197
Quant Strategy	IQ4_NL

Best for: Single 7900 XTX (24GB) with headroom for KV cache.

Vision Projector Files

File	Size	Description
`mmproj-gemma-4-31B-it-f16.gguf`	~1.12 GB	Vision encoder projector (F16)
`mmproj-gemma-4-31B-it-q8_0.gguf`	~0.75 GB	Vision encoder projector (Q8_0)

Performance Benchmarks

2x AMD Radeon RX 7900 XTX (48GB Total VRAM)

Variant L (20.11 GB) — llama.cpp-b8703, -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 256

Test	Tokens/sec
pp1024	1381.37 ± 2.36
pp2048	1494.88 ± 1.42
pp4096	1530.94 ± 0.97
pp8192	1493.17 ± 0.76
pp16384	1377.77 ± 0.66
pp32768	1177.68 ± 0.54
tg128	24.14 ± 0.01
tg512	23.82 ± 0.01
tg1024	23.49 ± 0.00

Variant M (18.68 GB) — Same configuration

Test	Tokens/sec
pp1024	1361.17 ± 1.42
pp2048	1473.53 ± 1.80
pp4096	1510.83 ± 1.21
pp8192	1476.50 ± 0.61
pp16384	1365.21 ± 0.38
pp32768	1169.50 ± 0.16
tg128	25.05 ± 0.01
tg512	24.66 ± 0.02
tg1024	24.30 ± 0.01

Single AMD Radeon RX 7900 XTX (24GB VRAM)

Variant L (20.11 GB) — -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 1024

Test	Tokens/sec
pp1024	962.09 ± 0.75
pp2048	925.67 ± 0.24
pp4096	886.89 ± 0.39
pp8192	834.71 ± 0.27
pp16384	752.85 ± 0.22
pp32768	OOM crash
tg128	26.73 ± 0.02
tg512	26.33 ± 0.01
tg1024	25.94 ± 0.00

Variant M (18.68 GB) — Same configuration

Test	Tokens/sec
pp1024	955.37 ± 0.72
pp2048	919.04 ± 0.27
pp4096	880.58 ± 0.30
pp8192	829.50 ± 0.12
pp16384	749.10 ± 0.16
pp32768	629.07 ± 0.17
tg128	27.98 ± 0.01
tg512	27.53 ± 0.01
tg1024	27.08 ± 0.00

Quality Metrics

Evaluated on wikitext-2-raw with F16 baseline:

Variant	PPL	KLD vs F16	Notes
F16 Baseline	12250.49	0.000	Reference
Variant L	~12565	0.647	K-quants for attn_k/output
Variant M	~12197	0.684	Pure IQ4_NL

KLD (KL Divergence) measures output distribution divergence from F16 baseline. Lower is better.

Usage with llama.cpp

Basic Chat (Single GPU)

# Load model with full context on single 7900 XTX
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 1024 -t 5

Multi-GPU (2x 7900 XTX)

# Split across 2 GPUs with smaller batch for higher throughput
./llama-server -m gemma-4-31B-it-IQ4_NL_L_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 256 -b 1024 -t 5 \
    -ts 1,1

API Server

# Start OpenAI-compatible API
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 1024 -t 5 \
    --port 8080

Hardware Requirements

Variant	Minimum VRAM	Recommended
Variant L (20GB)	24GB (single GPU, limited context)	48GB (2x 7900 XTX)
Variant M (18GB)	24GB (single 7900 XTX)	48GB (2x 7900 XTX)

Note: Both variants were tested on AMD ROCm 6.4.4 with llama.cpp b8703+. NVIDIA GPUs will work as well but they were not focus group.

Quantization Method

Variant L (IQ4_NL_L) — Tensor Composition

Tensor Group	Quantization	Layers
`attn_k` + `attn_output`	IQ4_NL	All 60 layers
`attn_q`	Q8_0	Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers)
`attn_q`	IQ4_NL	Remaining layers
`attn_v`	Q8_0	Layers 0-4, 6, 9, 13, 16, 20, 24, 27, 31, 34, 38, 42, 45, 49, 51, 52, 54-58 (23 layers)
`attn_v`	IQ4_NL	Remaining layers
`ffn_down`	Q8_0	Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers)
`ffn_down`	IQ4_NL	Remaining layers
`ffn_gate` + `ffn_up`	Q4_1	All 60 layers
`token_embd`	Q8_0	-
`output_norm`	F32	-

Variant M (Pure IQ4_NL) — Tensor Composition

Tensor Group	Quantization	Layers
`attn_k` + `attn_output`	IQ4_NL	All 60 layers
`attn_q`	Q8_0	Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers)
`attn_q`	IQ4_NL	Remaining layers
`attn_v`	Q8_0	Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers)
`attn_v`	IQ4_NL	Remaining layers
`ffn_down`	Q4_1	All 60 layers (pure IQ4_NL base)
`ffn_gate` + `ffn_up`	Q4_1	All 60 layers
`token_embd`	Q8_0	-
`output_norm`	F32	-

Calibration

Dataset: Mixed task-specific calibration (Magicoder + APIGen-MT + When2Call)
Method: imatrix with task-formatted data
Group Size: 128
imatrix entries: 410
imatrix chunks: 100

Files in This Repository

File	Description
`gemma-4-31B-it-IQ4_NL_L_AMD.gguf`	Variant L (20GB, K-quants for attn)
`gemma-4-31B-it-IQ4_NL_M_AMD.gguf`	Variant M (18GB, pure IQ4_NL)
`mmproj-gemma-4-31B-it-f16.gguf`	Vision projector (F16)
`mmproj-gemma-4-31B-it-q8_0.gguf`	Vision projector (Q8_0)
`imatrix.dat`	Calibration data (optional)

License

This quantization is released under the Apache 2.0 License.

The base model google/gemma-4-31B-it is also licensed under Apache 2.0.

See LICENSE for the full text.

Citation

If you use this model in your research, please cite:

@misc{gemma4-31b-gguf-amd,
  title = {Gemma-4-31B-it GGUF Quantization for AMD RDNA3},
  author = {ebircak},
  year = {2026},
  howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-GGUF}},
  note = {Quantized with llama.cpp b8703 for AMD gfx1100}
}

Disclaimer

This is a community quantization of the Google Gemma-4-31B-it model optimized for AMD RDNA3 GPUs. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.

These builds are specifically optimized for AMD Radeon RX 7900 XTX (gfx1100) using ROCm. They will work on other AMD GPUs or NVIDIA GPUs without modification but there might be other GGUF quantizations available with better size/KLD ratio for them.

Model Card Version

This model card follows the Model Cards for Model Reporting standard.

Original Model: google/gemma-4-31B-it
Quantization Tool: llama.cpp
Quantization Format: GGUF V3
Target Hardware: AMD Radeon RX 7900 XTX (gfx1100, RDNA3)

Downloads last month: 120

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

4-bit

Model tree for ebircak/gemma-4-31B-it-GGUF

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Quantized

(217)

this model

Paper for ebircak/gemma-4-31B-it-GGUF

Model Cards for Model Reporting

Paper • 1810.03993 • Published Oct 5, 2018 • 7