Instructions to use RthItalia/nano_compact_3b_qkvfp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RthItalia/nano_compact_3b_qkvfp16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RthItalia/nano_compact_3b_qkvfp16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RthItalia/nano_compact_3b_qkvfp16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("RthItalia/nano_compact_3b_qkvfp16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RthItalia/nano_compact_3b_qkvfp16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RthItalia/nano_compact_3b_qkvfp16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/nano_compact_3b_qkvfp16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RthItalia/nano_compact_3b_qkvfp16

SGLang

How to use RthItalia/nano_compact_3b_qkvfp16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RthItalia/nano_compact_3b_qkvfp16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/nano_compact_3b_qkvfp16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RthItalia/nano_compact_3b_qkvfp16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RthItalia/nano_compact_3b_qkvfp16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RthItalia/nano_compact_3b_qkvfp16 with Docker Model Runner:
```
docker model run hf.co/RthItalia/nano_compact_3b_qkvfp16
```

Nano Compact 3B QKV-FP16

RthItalia/nano_compact_3b_qkvfp16 is the validated compact self-contained variant derived from Qwen/Qwen2.5-3B-Instruct.

This release is not the old overlay artifact zip. It is the final exported Hugging Face folder that loads directly with transformers plus trust_remote_code=True.

What This Variant Is

This runtime uses a mixed policy:

q_proj, k_proj, v_proj: fp16
o_proj and most of the remaining body: Nano compact format
model.embed_tokens: quantized single copy
lm_head: tied custom head over the quantized embeddings

The goal of this policy is practical balance between:

disk size
VRAM footprint
quality relative to the true 8bit baseline

Validated Runtime Envelope

model size: 2.3432 GB
allocated after load: 2.3432 GB
peak generation VRAM: about 2.44 GB

True 8bit baseline used for comparison:

allocated after load: 3.1703 GB
peak generation VRAM: about 3.21 GB

Quality Claim

The quality claim for this release is intentionally narrow:

it was compared against the true 8bit baseline on a small internal prompt suite
it is not claimed to match the full original model in every task
it is not claimed to outperform the base model

More aggressive variants reached better size or VRAM numbers, but failed the quality gate against the true 8bit reference.
qkvfp16 was the first variant that restored acceptable behavior on the validation smoke suite while preserving a substantial memory advantage.

How To Load

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "RthItalia/nano_compact_3b_qkvfp16"

tok = AutoTokenizer.from_pretrained(
    repo_id,
    use_fast=True,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="cuda",
    dtype=torch.float16,
).eval()

Example

messages = [
    {"role": "user", "content": "Explain what a neural network is in exactly 3 simple sentences."}
]

text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inp = tok(text, return_tensors="pt").to(next(model.parameters()).device)

with torch.no_grad():
    out = model.generate(
        **inp,
        max_new_tokens=120,
        do_sample=False,
        repetition_penalty=1.08,
        eos_token_id=tok.eos_token_id,
        pad_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0][inp["input_ids"].shape[-1]:], skip_special_tokens=True))

Requirements

pip install torch transformers accelerate safetensors

bitsandbytes is not required for this final exported winner variant at runtime.

Important Notes

trust_remote_code=True is required.
The custom runtime uses a NanoTiedHead implementation that ties output logits to the quantized embedding table without registering the embedding module twice.
Custom linear layers use chunked forward paths to keep peak VRAM under control.

Limitations

Validation was narrow and engineering-driven, not a full benchmark suite.
This release is specifically tuned around Qwen/Qwen2.5-3B-Instruct.
It should be treated as a compact experimental runtime artifact, not as a general scientific proof of broader architectural claims.

License Note

This release should be described as a dual-license or dual-layer research distribution.

Built with Qwen.

The intended reading is:

Qwen-derived materials retain the relevant Qwen Research License obligations
Nano-specific runtime, packaging, and documentation changes are additional repository-authored research components

For redistributed copies, keep:

LICENSE
NOTICE
clear indication of modified files where applicable

The Hugging Face metadata still stays at license: other because this is not accurately described by a single simple SPDX identifier.

Downloads last month: 31

Safetensors

Model size

2B params

Tensor type

I64

F16

Model tree for RthItalia/nano_compact_3b_qkvfp16

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1295)

this model

Evaluation results

model_size_gb on Internal 4-prompt smoke suite
self-reported

2.343
vram_load_gb on Internal 4-prompt smoke suite
self-reported

2.343
vram_peak_generate_gb on Internal 4-prompt smoke suite
self-reported

2.440
baseline_true_8bit_load_gb on Internal 4-prompt smoke suite
self-reported

3.170
baseline_true_8bit_peak_gb on Internal 4-prompt smoke suite
self-reported

3.210