Is it mandatory to use flash-attn?

#3
by zhousp666 - opened

My GPU is a Tesla T4, and it keeps prompting me with the following error:
ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['compute capability not supported']
And also:
Cannot use FA version 2 as it is not supported due to FA2 being only supported on devices with compute capability >= 8

I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.

I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.

Yes, I am using a model loaded via vLLM, but I did not use the --attention-backend flash_attn parameter. It still prompts me with "FLASH_ATTN is not valid for this configuration."

I think vLLM currently limits the DFlash draft model from using any other attention backend besides flash attention. But SGLang DFlash implementation allows you to select between flashinfer, flash attention and probably also triton, which can be used on Tesla T4.

Yes, I am using a model loaded via vLLM, but I did not use the --attention-backend flash_attn parameter. It still prompts me with "FLASH_ATTN is not valid for this configuration."

Flash attention is currently the only available attention backend for DFlash draft model in vLLM. There are some PRs about supporting flashinfer for DFlash in vLLM but they haven't been merged. SGLang is probably the only choice on Tesla T4.

Sign up or log in to comment