Instructions to use z-lab/Qwen3.5-9B-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use z-lab/Qwen3.5-9B-DFlash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="z-lab/Qwen3.5-9B-DFlash", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("z-lab/Qwen3.5-9B-DFlash", trust_remote_code=True) model = AutoModel.from_pretrained("z-lab/Qwen3.5-9B-DFlash", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use z-lab/Qwen3.5-9B-DFlash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "z-lab/Qwen3.5-9B-DFlash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.5-9B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/z-lab/Qwen3.5-9B-DFlash
- SGLang
How to use z-lab/Qwen3.5-9B-DFlash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "z-lab/Qwen3.5-9B-DFlash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.5-9B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "z-lab/Qwen3.5-9B-DFlash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.5-9B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use z-lab/Qwen3.5-9B-DFlash with Docker Model Runner:
docker model run hf.co/z-lab/Qwen3.5-9B-DFlash
Is it mandatory to use flash-attn?
My GPU is a Tesla T4, and it keeps prompting me with the following error:
ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['compute capability not supported']
And also:
Cannot use FA version 2 as it is not supported due to FA2 being only supported on devices with compute capability >= 8
I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.
I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.
Yes, I am using a model loaded via vLLM, but I did not use the --attention-backend flash_attn parameter. It still prompts me with "FLASH_ATTN is not valid for this configuration."
I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.
Yes, I am using a model loaded via vLLM, but I did not use the
--attention-backend flash_attnparameter. It still prompts me with "FLASH_ATTN is not valid for this configuration."
Flash attention is currently the only available attention backend for DFlash draft model in vLLM. There are some PRs about supporting flashinfer for DFlash in vLLM but they haven't been merged. SGLang is probably the only choice on Tesla T4.