Instructions to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("PeterAM4/Qwen3-Embedding-0.6B-GGUF")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

llama-cpp-python

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="PeterAM4/Qwen3-Embedding-0.6B-GGUF",
	filename="Qwen3-Embedding-0.6B-BF16.gguf",
)

llm.create_chat_completion(
	messages = "{\n    \"source_sentence\": \"That is a happy person\",\n    \"sentences\": [\n        \"That is a happy dog\",\n        \"That is a very happy person\",\n        \"Today is a sunny day\"\n    ]\n}"
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Ollama:
```
ollama run hf.co/PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for PeterAM4/Qwen3-Embedding-0.6B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for PeterAM4/Qwen3-Embedding-0.6B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for PeterAM4/Qwen3-Embedding-0.6B-GGUF to start chatting

Pi new

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Docker Model Runner:
```
docker model run hf.co/PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
```

Lemonade

How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-Embedding-0.6B-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3-Embedding-0.6B -- GGUF

All-in-one GGUF quantizations of Qwen/Qwen3-Embedding-0.6B, from 8-bit down to 1-bit, with importance-matrix calibration optimized for financial and technical text retrieval.

Qwen3-Embedding-0.6B is a compact, multilingual embedding model well suited for RAG pipelines, semantic search, and document retrieval. These quantizations make it practical to run on edge devices, laptops, and resource-constrained servers -- particularly for financial NLP workloads where low latency and small memory footprint matter.

The importance matrix was calibrated on a mixed corpus weighted toward financial data (financial Q&A from FiQA, SEC 10-K filings from FinanceBench, financial sentiment from Twitter, RAG pairs) alongside math reasoning and general text, so the quantized models preserve the weights most relevant to financial domain embeddings.

Property	Value
Base model	Qwen/Qwen3-Embedding-0.6B
Parameters	595,776,512
Max context	32,768 tokens
Pooling	Last token
Embedding dim	1024
License	Apache 2.0
Quantized with	llama.cpp

Why quantize?

Despite their size, neural networks are remarkably sparse in information density. Most of the 16 bits allocated per weight during training exist to make gradient descent work -- not to store knowledge. Current estimates put the actual information content at roughly 2 bits per parameter. The remaining 14 bits are redundancy.

This explains why aggressive quantization works: compressing from 16-bit to 4-bit (75% reduction) discards almost exclusively noise. Our benchmark data confirms this -- Q3_K_M-imat at 4.66 BPW scores within +0.62 PPL of the full BF16 baseline while being 70% smaller than BF16 (331 MB vs 1.1 GB). For comparison, Q4_K_M-imat at 5.32 BPW shows +0.65 PPL -- the 3-bit model actually edges it out. The imatrix (importance matrix) is key here: by profiling which weights actually carry signal, we preserve those at higher precision and compress the rest. This is why imatrix-calibrated 3-bit models can outperform naive 5-bit quantizations.

The quality cliff appears around 3 BPW, where we start cutting into real information. Below that, PPL degrades rapidly (Q2_K at 3.97 BPW: +391, IQ1_S at 2.79 BPW: +23,008). Ternary quantizations (TQ2_0, TQ1_0) diverge entirely on this architecture.

Benchmark results

All models evaluated with llama-perplexity on a 22 MB calibration corpus (financial, math, and general text). Context window: 1536 tokens. Chunks: 200. Lower PPL = better.

Baseline PPL (BF16): 406.0250

Model	Size	BPW	PPL	Delta PPL
Qwen3-Embedding-0.6B-BF16.gguf (unquantized) baseline	1.1G	16.08	406.0250	--
Qwen3-Embedding-0.6B-Q8_0.gguf	610M	8.58	409.5689	+3.54
Qwen3-Embedding-0.6B-Q6_K.gguf	472M	6.65	417.3712	+11.35
Qwen3-Embedding-0.6B-Q5_1.gguf	442M	6.23	426.9407	+20.92
Qwen3-Embedding-0.6B-Q5_K_M.gguf	424M	5.96	442.9431	+36.92
Qwen3-Embedding-0.6B-Q5_0.gguf	416M	5.86	413.1916	+7.17
Qwen3-Embedding-0.6B-Q5_K_S.gguf	416M	5.86	414.9329	+8.91
Qwen3-Embedding-0.6B-Q4_1-imat.gguf	390M	5.49	403.0646	-2.96
Qwen3-Embedding-0.6B-Q4_K_M-imat.gguf	378M	5.32	406.6788	+0.65
Qwen3-Embedding-0.6B-Q4_K_S-imat.gguf	365M	5.14	406.9947	+0.97
Qwen3-Embedding-0.6B-Q4_0-imat.gguf	364M	5.13	419.8843	+13.86
Qwen3-Embedding-0.6B-IQ4_NL-imat.gguf	364M	5.12	435.0203	+29.00
Qwen3-Embedding-0.6B-Q3_K_L-imat.gguf	351M	4.94	412.0217	+6.00
Qwen3-Embedding-0.6B-IQ4_XS-imat.gguf	351M	4.94	451.4025	+45.38
Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf recommended	331M	4.66	406.6408	+0.62
Qwen3-Embedding-0.6B-IQ3_M-imat.gguf	320M	4.51	460.9405	+54.92
Qwen3-Embedding-0.6B-IQ3_S-imat.gguf	308M	4.34	475.4797	+69.45
Qwen3-Embedding-0.6B-Q3_K_S-imat.gguf	308M	4.34	340.2907	-65.73
Qwen3-Embedding-0.6B-IQ3_XS-imat.gguf	298M	4.20	520.3907	+114.37
Qwen3-Embedding-0.6B-Q2_K-imat.gguf	282M	3.97	797.8549	+391.83
Qwen3-Embedding-0.6B-Q2_K_S-imat.gguf	267M	3.76	1561.2449	+1155.22
Qwen3-Embedding-0.6B-IQ3_XXS-imat.gguf	266M	3.74	613.9329	+207.91
Qwen3-Embedding-0.6B-IQ2_M-imat.gguf	252M	3.55	1283.4407	+877.42
Qwen3-Embedding-0.6B-IQ2_S-imat.gguf	242M	3.41	1857.4142	+1451.39
Qwen3-Embedding-0.6B-TQ2_0-imat.gguf	236M	3.32	diverged	N/A
Qwen3-Embedding-0.6B-IQ2_XS-imat.gguf	231M	3.25	3632.9250	+3226.90
Qwen3-Embedding-0.6B-IQ2_XXS-imat.gguf	219M	3.08	5641.8950	+5235.87
Qwen3-Embedding-0.6B-TQ1_0-imat.gguf	216M	3.04	diverged	N/A
Qwen3-Embedding-0.6B-IQ1_M-imat.gguf	206M	2.90	7495.4178	+7089.39
Qwen3-Embedding-0.6B-IQ1_S-imat.gguf	198M	2.79	23414.9432	+23008.92

Notes:

The -imat suffix means the model was quantized with importance-matrix calibration. This is what allows 3-4 bit models to stay close to baseline -- the imatrix tells the quantizer which weights carry real information.
Q3_K_S-imat reports an anomalously low PPL (340). This is a statistical artifact, not a genuine improvement over baseline.
TQ2_0 / TQ1_0 (ternary quantizations) produce diverged PPL on this architecture. They require CPU or CUDA (not supported on Apple Metal) and are not usable for this model.
Below ~4 BPW, quality degrades steeply. Below ~3 BPW, models are not recommended for any production use.

Choosing a model

Use case	Model	Notes
Maximum quality	Q8_0	Near-lossless, 610 MB
Best quality/size trade-off	Q3_K_M-imat	+0.62 PPL delta at 331 MB -- smallest model with near-baseline quality
Larger but safe margin	Q4_K_M-imat	+0.65 PPL delta at 378 MB
Extreme compression	Q2_K-imat	Usable for non-critical applications

Quantization method

All models were quantized from the BF16 source using llama-quantize from llama.cpp.

Three strategies were used:

Standard (Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q5_0, Q5_1) -- uniform precision reduction, no imatrix.
K-Quant + imatrix (Q4_K_M, Q4_K_S, Q4_0, Q4_1, Q3_K_L, Q3_K_M, Q3_K_S, Q2_K, Q2_K_S) -- block-level mixed precision, importance matrix recommended.
Importance-weighted (IQ4_NL, IQ4_XS, IQ3_M, IQ3_S, IQ3_XS, IQ3_XXS, IQ2_M, IQ2_S, IQ2_XS, IQ2_XXS, IQ1_M, IQ1_S, TQ2_0, TQ1_0) -- non-linear quantization, imatrix required.

Calibration corpus

The importance matrix was generated from a mixed-domain corpus (22 MB, ~198,000 lines). The mix was chosen to cover the primary use case (financial text) while including general and mathematical text to maintain broad capability:

Dataset	Source	Domain	Entries
WikiText-2	ggml-org	General knowledge	36,718 lines
Twitter Financial News	zeroshot/twitter-financial-news-sentiment	Financial sentiment	9,543
GSM8K	openai/gsm8k	Math word problems	7,473
Financial RAG	philschmid/finanical-rag-embedding-dataset	Financial Q&A pairs	6,998
FiQA	explodinggradients/fiqa	Personal finance Q&A	5,650
MATH Competition	DigitalLearningGmbH/MATH-lighteval	Competition math	5,000
FinanceBench	PatronusAI/financebench	SEC 10-K filings	150

The financial datasets (FiQA + Twitter Financial News + Financial RAG + FinanceBench) contribute ~22,300 entries of domain-specific text covering sentiment, Q&A, RAG pairs, and SEC filings -- ensuring the importance matrix prioritizes weights relevant to financial terminology and reasoning.

Usage

llama.cpp server

./llama-server \
    -m Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf \
    --embedding --pooling last \
    -c 32768 -np 8 \
    --host 0.0.0.0 --port 8080

Python (llama-cpp-python)

from llama_cpp import Llama

model = Llama(
    model_path="Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf",
    embedding=True,
    pooling_type="last",
    n_ctx=32768,
)

result = model.create_embedding(["Financial analysis of Q3 earnings"])
print(len(result["data"][0]["embedding"]))  # 1024

Download a specific file

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="PeterAM4/Qwen3-Embedding-0.6B-GGUF",
    filename="Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf",
)

Technical details

GGUF format v3
Tokenizer: Qwen3 (151,936 tokens)
add_eos_token: false (patched for llama.cpp compatibility; EOS token ID 151643 is still present and usable)
Pooling type: 3 (last token)
Hardware used: Apple M3 Pro with Metal acceleration (CPU fallback for ternary quants)

Credits

Qwen/Qwen3-Embedding-0.6B by Alibaba Qwen Team
llama.cpp by Georgi Gerganov et al.

License

Apache 2.0, inherited from the original model.

Downloads last month: 3,562

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for PeterAM4/Qwen3-Embedding-0.6B-GGUF

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Quantized

(232)

this model

PeterAM4
/

Qwen3-Embedding-0.6B-GGUF