Instructions to use z-lab/Qwen3.5-9B-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use z-lab/Qwen3.5-9B-DFlash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="z-lab/Qwen3.5-9B-DFlash", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("z-lab/Qwen3.5-9B-DFlash", trust_remote_code=True)
model = AutoModel.from_pretrained("z-lab/Qwen3.5-9B-DFlash", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use z-lab/Qwen3.5-9B-DFlash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "z-lab/Qwen3.5-9B-DFlash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.5-9B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/z-lab/Qwen3.5-9B-DFlash

SGLang

How to use z-lab/Qwen3.5-9B-DFlash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "z-lab/Qwen3.5-9B-DFlash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.5-9B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "z-lab/Qwen3.5-9B-DFlash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.5-9B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use z-lab/Qwen3.5-9B-DFlash with Docker Model Runner:
```
docker model run hf.co/z-lab/Qwen3.5-9B-DFlash
```

Is it mandatory to use flash-attn?

by zhousp666 - opened 29 days ago

Discussion

zhousp666

29 days ago

My GPU is a Tesla T4, and it keeps prompting me with the following error:
ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['compute capability not supported']
And also:
Cannot use FA version 2 as it is not supported due to FA2 being only supported on devices with compute capability >= 8

jianchen0311

Z Lab org 28 days ago

I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.

zhousp666

28 days ago

I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.

Yes, I am using a model loaded via vLLM, but I did not use the --attention-backend flash_attn parameter. It still prompts me with "FLASH_ATTN is not valid for this configuration."

jianchen0311

Z Lab org 28 days ago

I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.

Yes, I am using a model loaded via vLLM, but I did not use the --attention-backend flash_attn parameter. It still prompts me with "FLASH_ATTN is not valid for this configuration."

Flash attention is currently the only available attention backend for DFlash draft model in vLLM. There are some PRs about supporting flashinfer for DFlash in vLLM but they haven't been merged. SGLang is probably the only choice on Tesla T4.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment