Instructions to use ebircak/gemma-4-31B-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ebircak/gemma-4-31B-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ebircak/gemma-4-31B-it-GGUF", filename="gemma-4-31B-it-IQ4_NL_L_AMD.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ebircak/gemma-4-31B-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L # Run inference directly in the terminal: llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L # Run inference directly in the terminal: llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L # Run inference directly in the terminal: ./llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L # Run inference directly in the terminal: ./build/bin/llama-cli -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Use Docker
docker model run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
- LM Studio
- Jan
- vLLM
How to use ebircak/gemma-4-31B-it-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ebircak/gemma-4-31B-it-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebircak/gemma-4-31B-it-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
- Ollama
How to use ebircak/gemma-4-31B-it-GGUF with Ollama:
ollama run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
- Unsloth Studio
How to use ebircak/gemma-4-31B-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ebircak/gemma-4-31B-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ebircak/gemma-4-31B-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ebircak/gemma-4-31B-it-GGUF to start chatting
- Pi
How to use ebircak/gemma-4-31B-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ebircak/gemma-4-31B-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Run Hermes
hermes
- Docker Model Runner
How to use ebircak/gemma-4-31B-it-GGUF with Docker Model Runner:
docker model run hf.co/ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
- Lemonade
How to use ebircak/gemma-4-31B-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ebircak/gemma-4-31B-it-GGUF:IQ4_NL_L
Run and chat with the model
lemonade run user.gemma-4-31B-it-GGUF-IQ4_NL_L
List all available models
lemonade list
Gemma-4-31B-it GGUF Quantization for AMD RDNA3 (gfx1100)
This is a GGUF quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for AMD Radeon RX 7900 XTX (gfx1100 / RDNA3) GPUs using llama.cpp.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Quantization Format | GGUF V3 (latest) |
| Quantization Type | multiple |
| Architecture | Gemma4ForConditionalGeneration |
| Layers | 60 decoder layers |
| Hidden Size | 5376 |
| Context Window | 262K tokens |
| Vision Tower | SigLIP (27 layers, F16/Q8_0 mmproj) |
| Quantization Library | llama.cpp b8703 |
| ROCm Version | 6.4.4 |
| Hardware Target | AMD Radeon RX 7900 XTX (gfx1100, 24GB VRAM) |
Available Variants
This repository contains quantization variants optimized for different performance/quality tradeoffs:
4-bit Variant L (Quality-Optimized) — gemma-4-31B-it-IQ4_NL_L_AMD.gguf
| Metric | Value |
|---|---|
| File Size | ~20.11 GB |
| bpw | ~4.5 bits per weight |
| KLD vs F16 | 0.647 (lower is better) |
| PPL (wikitext-2) | ~12565 |
| Quant Strategy | IQ4_NL |
Best for: Maximum quality deployment where VRAM allows (~20GB) with smaller context or 2-GPU setup.
Variant M (Size-Optimized) — gemma-4-31B-it-IQ4_NL_M_AMD.gguf
| Metric | Value |
|---|---|
| File Size | ~18.70 GB |
| bpw | ~4.25 bits per weight |
| KLD vs F16 | 0.684 (lower is better) |
| PPL (wikitext-2) | ~12197 |
| Quant Strategy | IQ4_NL |
Best for: Single 7900 XTX (24GB) with headroom for KV cache.
Vision Projector Files
| File | Size | Description |
|---|---|---|
mmproj-gemma-4-31B-it-f16.gguf |
~1.12 GB | Vision encoder projector (F16) |
mmproj-gemma-4-31B-it-q8_0.gguf |
~0.75 GB | Vision encoder projector (Q8_0) |
Performance Benchmarks
2x AMD Radeon RX 7900 XTX (48GB Total VRAM)
Variant L (20.11 GB) — llama.cpp-b8703, -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 256
| Test | Tokens/sec |
|---|---|
| pp1024 | 1381.37 ± 2.36 |
| pp2048 | 1494.88 ± 1.42 |
| pp4096 | 1530.94 ± 0.97 |
| pp8192 | 1493.17 ± 0.76 |
| pp16384 | 1377.77 ± 0.66 |
| pp32768 | 1177.68 ± 0.54 |
| tg128 | 24.14 ± 0.01 |
| tg512 | 23.82 ± 0.01 |
| tg1024 | 23.49 ± 0.00 |
Variant M (18.68 GB) — Same configuration
| Test | Tokens/sec |
|---|---|
| pp1024 | 1361.17 ± 1.42 |
| pp2048 | 1473.53 ± 1.80 |
| pp4096 | 1510.83 ± 1.21 |
| pp8192 | 1476.50 ± 0.61 |
| pp16384 | 1365.21 ± 0.38 |
| pp32768 | 1169.50 ± 0.16 |
| tg128 | 25.05 ± 0.01 |
| tg512 | 24.66 ± 0.02 |
| tg1024 | 24.30 ± 0.01 |
Single AMD Radeon RX 7900 XTX (24GB VRAM)
Variant L (20.11 GB) — -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 1024
| Test | Tokens/sec |
|---|---|
| pp1024 | 962.09 ± 0.75 |
| pp2048 | 925.67 ± 0.24 |
| pp4096 | 886.89 ± 0.39 |
| pp8192 | 834.71 ± 0.27 |
| pp16384 | 752.85 ± 0.22 |
| pp32768 | OOM crash |
| tg128 | 26.73 ± 0.02 |
| tg512 | 26.33 ± 0.01 |
| tg1024 | 25.94 ± 0.00 |
Variant M (18.68 GB) — Same configuration
| Test | Tokens/sec |
|---|---|
| pp1024 | 955.37 ± 0.72 |
| pp2048 | 919.04 ± 0.27 |
| pp4096 | 880.58 ± 0.30 |
| pp8192 | 829.50 ± 0.12 |
| pp16384 | 749.10 ± 0.16 |
| pp32768 | 629.07 ± 0.17 |
| tg128 | 27.98 ± 0.01 |
| tg512 | 27.53 ± 0.01 |
| tg1024 | 27.08 ± 0.00 |
Quality Metrics
Evaluated on wikitext-2-raw with F16 baseline:
| Variant | PPL | KLD vs F16 | Notes |
|---|---|---|---|
| F16 Baseline | 12250.49 | 0.000 | Reference |
| Variant L | ~12565 | 0.647 | K-quants for attn_k/output |
| Variant M | ~12197 | 0.684 | Pure IQ4_NL |
KLD (KL Divergence) measures output distribution divergence from F16 baseline. Lower is better.
Usage with llama.cpp
Basic Chat (Single GPU)
# Load model with full context on single 7900 XTX
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
-ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
-c 32768 -ub 1024 -t 5
Multi-GPU (2x 7900 XTX)
# Split across 2 GPUs with smaller batch for higher throughput
./llama-server -m gemma-4-31B-it-IQ4_NL_L_AMD.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
-ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
-c 32768 -ub 256 -b 1024 -t 5 \
-ts 1,1
API Server
# Start OpenAI-compatible API
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
-ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
-c 32768 -ub 1024 -t 5 \
--port 8080
Hardware Requirements
| Variant | Minimum VRAM | Recommended |
|---|---|---|
| Variant L (20GB) | 24GB (single GPU, limited context) | 48GB (2x 7900 XTX) |
| Variant M (18GB) | 24GB (single 7900 XTX) | 48GB (2x 7900 XTX) |
Note: Both variants were tested on AMD ROCm 6.4.4 with llama.cpp b8703+. NVIDIA GPUs will work as well but they were not focus group.
Quantization Method
Variant L (IQ4_NL_L) — Tensor Composition
| Tensor Group | Quantization | Layers |
|---|---|---|
attn_k + attn_output |
IQ4_NL | All 60 layers |
attn_q |
Q8_0 | Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers) |
attn_q |
IQ4_NL | Remaining layers |
attn_v |
Q8_0 | Layers 0-4, 6, 9, 13, 16, 20, 24, 27, 31, 34, 38, 42, 45, 49, 51, 52, 54-58 (23 layers) |
attn_v |
IQ4_NL | Remaining layers |
ffn_down |
Q8_0 | Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers) |
ffn_down |
IQ4_NL | Remaining layers |
ffn_gate + ffn_up |
Q4_1 | All 60 layers |
token_embd |
Q8_0 | - |
output_norm |
F32 | - |
Variant M (Pure IQ4_NL) — Tensor Composition
| Tensor Group | Quantization | Layers |
|---|---|---|
attn_k + attn_output |
IQ4_NL | All 60 layers |
attn_q |
Q8_0 | Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers) |
attn_q |
IQ4_NL | Remaining layers |
attn_v |
Q8_0 | Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers) |
attn_v |
IQ4_NL | Remaining layers |
ffn_down |
Q4_1 | All 60 layers (pure IQ4_NL base) |
ffn_gate + ffn_up |
Q4_1 | All 60 layers |
token_embd |
Q8_0 | - |
output_norm |
F32 | - |
Calibration
- Dataset: Mixed task-specific calibration (Magicoder + APIGen-MT + When2Call)
- Method: imatrix with task-formatted data
- Group Size: 128
- imatrix entries: 410
- imatrix chunks: 100
Files in This Repository
| File | Description |
|---|---|
gemma-4-31B-it-IQ4_NL_L_AMD.gguf |
Variant L (20GB, K-quants for attn) |
gemma-4-31B-it-IQ4_NL_M_AMD.gguf |
Variant M (18GB, pure IQ4_NL) |
mmproj-gemma-4-31B-it-f16.gguf |
Vision projector (F16) |
mmproj-gemma-4-31B-it-q8_0.gguf |
Vision projector (Q8_0) |
imatrix.dat |
Calibration data (optional) |
License
This quantization is released under the Apache 2.0 License.
The base model google/gemma-4-31B-it is also licensed under Apache 2.0.
See LICENSE for the full text.
Citation
If you use this model in your research, please cite:
@misc{gemma4-31b-gguf-amd,
title = {Gemma-4-31B-it GGUF Quantization for AMD RDNA3},
author = {ebircak},
year = {2026},
howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-GGUF}},
note = {Quantized with llama.cpp b8703 for AMD gfx1100}
}
Disclaimer
This is a community quantization of the Google Gemma-4-31B-it model optimized for AMD RDNA3 GPUs. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.
These builds are specifically optimized for AMD Radeon RX 7900 XTX (gfx1100) using ROCm. They will work on other AMD GPUs or NVIDIA GPUs without modification but there might be other GGUF quantizations available with better size/KLD ratio for them.
Model Card Version
This model card follows the Model Cards for Model Reporting standard.
Original Model: google/gemma-4-31B-it
Quantization Tool: llama.cpp
Quantization Format: GGUF V3
Target Hardware: AMD Radeon RX 7900 XTX (gfx1100, RDNA3)
- Downloads last month
- 120
4-bit