Qwen3-VL-Embedding-2B-FP8

This is an FP8 quantized version of Qwen/Qwen3-VL-Embedding-2B.

Quantization Details

Component	Precision	Notes
Vision Encoder (ViT)	BF16	Preserved for accuracy
LLM Decoder Layers	FP8	Quantized for efficiency
Embeddings	BF16	Preserved

Scheme: FP8_DYNAMIC
- Weights: FP8_E4M3 (per-channel quantization)
- Activations: Dynamic per-token quantization at runtime
Tool: llm-compressor
Calibration: None required (data-free quantization)

Creation

This model was quantized using llm-compressor:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
# FP8 quantization recipe (data-free)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        r"re:model\.visual\..*",  # Keep vision encoder in BF16
    ]
)
# Apply quantization
oneshot(model=model, recipe=recipe)
# Save
model.save_pretrained("Qwen3-VL-Embedding-2B-FP8", save_compressed=True)

Usage

Requirements

transformers>=4.57.0
qwen-vl-utils>=0.0.14
torch==2.8.0
llmcompressor==0.9.0.2

Basic Example

from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
import numpy as np
import torch

# Define a list of query texts
queries = [
    {"text": "Visible embers scatter across the ground."},      # Fire prompt
    {"text": "Routine scene with no disturbances."},            # Normal prompt
]

# Define a list of document (images, texts, videos)
documents = [
    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."},
    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"video": "video.mp4"}
]

# Initialize the Qwen3VLEmbedder model
model_name_or_path = "PIA-SPACE-LAB/Qwen3-VL-Embedding-2B-FP8"
model = Qwen3VLEmbedder(model_name_or_path=model_name_or_path, max_frames=8, fps=8)

# Combine queries and documents into a single input list
inputs = queries + documents

# Process the inputs to get embeddings
embeddings = model.process(inputs)

# Compute similarity scores between query embeddings and document embeddings
similarity_scores = (embeddings[:4] @ embeddings[4:].T)

# Print out the similarity scores in a list format
print(similarity_scores.tolist())

For more usage examples, please visit our GitHub repository.

Downloads last month: 48

Safetensors

Model size

2B params

Tensor type

BF16

F8_E4M3

Model tree for PIA-SPACE-LAB/Qwen3-VL-Embedding-2B-FP8

Base model

Qwen/Qwen3-VL-2B-Instruct

Quantized

(56)

this model