Instructions to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("PeterAM4/Qwen3-Embedding-0.6B-GGUF") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - llama-cpp-python
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="PeterAM4/Qwen3-Embedding-0.6B-GGUF", filename="Qwen3-Embedding-0.6B-BF16.gguf", )
llm.create_chat_completion( messages = "{\n \"source_sentence\": \"That is a happy person\",\n \"sentences\": [\n \"That is a happy dog\",\n \"That is a very happy person\",\n \"Today is a sunny day\"\n ]\n}" ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Ollama:
ollama run hf.co/PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
- Unsloth Studio new
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for PeterAM4/Qwen3-Embedding-0.6B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for PeterAM4/Qwen3-Embedding-0.6B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for PeterAM4/Qwen3-Embedding-0.6B-GGUF to start chatting
- Pi new
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Docker Model Runner:
docker model run hf.co/PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
- Lemonade
How to use PeterAM4/Qwen3-Embedding-0.6B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull PeterAM4/Qwen3-Embedding-0.6B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-Embedding-0.6B-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3-Embedding-0.6B -- GGUF
All-in-one GGUF quantizations of Qwen/Qwen3-Embedding-0.6B, from 8-bit down to 1-bit, with importance-matrix calibration optimized for financial and technical text retrieval.
Qwen3-Embedding-0.6B is a compact, multilingual embedding model well suited for RAG pipelines, semantic search, and document retrieval. These quantizations make it practical to run on edge devices, laptops, and resource-constrained servers -- particularly for financial NLP workloads where low latency and small memory footprint matter.
The importance matrix was calibrated on a mixed corpus weighted toward financial data (financial Q&A from FiQA, SEC 10-K filings from FinanceBench, financial sentiment from Twitter, RAG pairs) alongside math reasoning and general text, so the quantized models preserve the weights most relevant to financial domain embeddings.
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-Embedding-0.6B |
| Parameters | 595,776,512 |
| Max context | 32,768 tokens |
| Pooling | Last token |
| Embedding dim | 1024 |
| License | Apache 2.0 |
| Quantized with | llama.cpp |
Why quantize?
Despite their size, neural networks are remarkably sparse in information density. Most of the 16 bits allocated per weight during training exist to make gradient descent work -- not to store knowledge. Current estimates put the actual information content at roughly 2 bits per parameter. The remaining 14 bits are redundancy.
This explains why aggressive quantization works: compressing from 16-bit to 4-bit (75% reduction) discards almost exclusively noise. Our benchmark data confirms this -- Q3_K_M-imat at 4.66 BPW scores within +0.62 PPL of the full BF16 baseline while being 70% smaller than BF16 (331 MB vs 1.1 GB). For comparison, Q4_K_M-imat at 5.32 BPW shows +0.65 PPL -- the 3-bit model actually edges it out. The imatrix (importance matrix) is key here: by profiling which weights actually carry signal, we preserve those at higher precision and compress the rest. This is why imatrix-calibrated 3-bit models can outperform naive 5-bit quantizations.
The quality cliff appears around 3 BPW, where we start cutting into real information. Below that, PPL degrades rapidly (Q2_K at 3.97 BPW: +391, IQ1_S at 2.79 BPW: +23,008). Ternary quantizations (TQ2_0, TQ1_0) diverge entirely on this architecture.
Benchmark results
All models evaluated with llama-perplexity on a 22 MB calibration corpus (financial, math, and general text). Context window: 1536 tokens. Chunks: 200. Lower PPL = better.
Baseline PPL (BF16): 406.0250
Notes:
- The -imat suffix means the model was quantized with importance-matrix calibration. This is what allows 3-4 bit models to stay close to baseline -- the imatrix tells the quantizer which weights carry real information.
- Q3_K_S-imat reports an anomalously low PPL (340). This is a statistical artifact, not a genuine improvement over baseline.
- TQ2_0 / TQ1_0 (ternary quantizations) produce diverged PPL on this architecture. They require CPU or CUDA (not supported on Apple Metal) and are not usable for this model.
- Below ~4 BPW, quality degrades steeply. Below ~3 BPW, models are not recommended for any production use.
Choosing a model
| Use case | Model | Notes |
|---|---|---|
| Maximum quality | Q8_0 | Near-lossless, 610 MB |
| Best quality/size trade-off | Q3_K_M-imat | +0.62 PPL delta at 331 MB -- smallest model with near-baseline quality |
| Larger but safe margin | Q4_K_M-imat | +0.65 PPL delta at 378 MB |
| Extreme compression | Q2_K-imat | Usable for non-critical applications |
Quantization method
All models were quantized from the BF16 source using llama-quantize from llama.cpp.
Three strategies were used:
- Standard (Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q5_0, Q5_1) -- uniform precision reduction, no imatrix.
- K-Quant + imatrix (Q4_K_M, Q4_K_S, Q4_0, Q4_1, Q3_K_L, Q3_K_M, Q3_K_S, Q2_K, Q2_K_S) -- block-level mixed precision, importance matrix recommended.
- Importance-weighted (IQ4_NL, IQ4_XS, IQ3_M, IQ3_S, IQ3_XS, IQ3_XXS, IQ2_M, IQ2_S, IQ2_XS, IQ2_XXS, IQ1_M, IQ1_S, TQ2_0, TQ1_0) -- non-linear quantization, imatrix required.
Calibration corpus
The importance matrix was generated from a mixed-domain corpus (22 MB, ~198,000 lines). The mix was chosen to cover the primary use case (financial text) while including general and mathematical text to maintain broad capability:
| Dataset | Source | Domain | Entries |
|---|---|---|---|
| WikiText-2 | ggml-org | General knowledge | 36,718 lines |
| Twitter Financial News | zeroshot/twitter-financial-news-sentiment | Financial sentiment | 9,543 |
| GSM8K | openai/gsm8k | Math word problems | 7,473 |
| Financial RAG | philschmid/finanical-rag-embedding-dataset | Financial Q&A pairs | 6,998 |
| FiQA | explodinggradients/fiqa | Personal finance Q&A | 5,650 |
| MATH Competition | DigitalLearningGmbH/MATH-lighteval | Competition math | 5,000 |
| FinanceBench | PatronusAI/financebench | SEC 10-K filings | 150 |
The financial datasets (FiQA + Twitter Financial News + Financial RAG + FinanceBench) contribute ~22,300 entries of domain-specific text covering sentiment, Q&A, RAG pairs, and SEC filings -- ensuring the importance matrix prioritizes weights relevant to financial terminology and reasoning.
Usage
llama.cpp server
./llama-server \
-m Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf \
--embedding --pooling last \
-c 32768 -np 8 \
--host 0.0.0.0 --port 8080
Python (llama-cpp-python)
from llama_cpp import Llama
model = Llama(
model_path="Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf",
embedding=True,
pooling_type="last",
n_ctx=32768,
)
result = model.create_embedding(["Financial analysis of Q3 earnings"])
print(len(result["data"][0]["embedding"])) # 1024
Download a specific file
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="PeterAM4/Qwen3-Embedding-0.6B-GGUF",
filename="Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf",
)
Technical details
- GGUF format v3
- Tokenizer: Qwen3 (151,936 tokens)
- add_eos_token: false (patched for llama.cpp compatibility; EOS token ID 151643 is still present and usable)
- Pooling type: 3 (last token)
- Hardware used: Apple M3 Pro with Metal acceleration (CPU fallback for ternary quants)
Credits
- Qwen/Qwen3-Embedding-0.6B by Alibaba Qwen Team
- llama.cpp by Georgi Gerganov et al.
License
Apache 2.0, inherited from the original model.
- Downloads last month
- 3,562
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit