Will there be a small model for speculative decoding?

#71

by Regrin - opened 4 days ago

Hello!
I'm very concerned about the speed of models. The easiest way to improve it without sacrificing quality is speculative decoding.
I have a slow computer, and I'd like to try it, but the problem is that there's no small model for Gemma 4.
Could you perhaps train a microscopic model for speculative decoding?
That would be REALLY REALLY REALLY good.

MeganEFlynn

4 days ago

Hello! I'm not sure about Google's plans, but we have trained a model for use in vLLM, which is available here: https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.eagle3. We intend to keep iterating over the next week. Feel free to try it out!

qenme

1 day ago

Hello! I'm not sure about Google's plans, but we have trained a model for use in vLLM, which is available here: https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.eagle3. We intend to keep iterating over the next week. Feel free to try it out!

This is amazing if it works. Do you know if it's supported with pipeline parallelism on VLLM?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment