Qwen3-VL-Embedding-2B-FP8

This is an FP8 quantized version of Qwen/Qwen3-VL-Embedding-2B.

Quantization Details

Component Precision Notes
Vision Encoder (ViT) BF16 Preserved for accuracy
LLM Decoder Layers FP8 Quantized for efficiency
Embeddings BF16 Preserved
  • Scheme: FP8_DYNAMIC
    • Weights: FP8_E4M3 (per-channel quantization)
    • Activations: Dynamic per-token quantization at runtime
  • Tool: llm-compressor
  • Calibration: None required (data-free quantization)

Creation

This model was quantized using llm-compressor:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
# FP8 quantization recipe (data-free)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        r"re:model\.visual\..*",  # Keep vision encoder in BF16
    ]
)
# Apply quantization
oneshot(model=model, recipe=recipe)
# Save
model.save_pretrained("Qwen3-VL-Embedding-2B-FP8", save_compressed=True)

Usage

  • Requirements
transformers>=4.57.0
qwen-vl-utils>=0.0.14
torch==2.8.0
llmcompressor==0.9.0.2
  • Basic Example
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
import numpy as np
import torch

# Define a list of query texts
queries = [
    {"text": "Visible embers scatter across the ground."},      # Fire prompt
    {"text": "Routine scene with no disturbances."},            # Normal prompt
]

# Define a list of document (images, texts, videos)
documents = [
    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"video": "video.mp4"}
]

# Initialize the Qwen3VLEmbedder model
model_name_or_path = "PIA-SPACE-LAB/Qwen3-VL-Embedding-2B-FP8"
model = Qwen3VLEmbedder(model_name_or_path=model_name_or_path, max_frames=8, fps=8)

# Combine queries and documents into a single input list
inputs = queries + documents

# Process the inputs to get embeddings
embeddings = model.process(inputs)

# Compute similarity scores between query embeddings and document embeddings
similarity_scores = (embeddings[:4] @ embeddings[4:].T)

# Print out the similarity scores in a list format
print(similarity_scores.tolist())

For more usage examples, please visit our GitHub repository.

Downloads last month
48
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PIA-SPACE-LAB/Qwen3-VL-Embedding-2B-FP8

Quantized
(56)
this model