I could reproduce the error.
What’s going on (two separate failure modes)
A) LLaVA “image tokens vs image features” mismatch (ValueError)
In transformers’ LLaVA implementation, the image is not “implicitly in the cache.” Instead:
- Your prompt must contain
image_token_idplaceholders. - During the forward pass, LLaVA computes image features from
pixel_values. - It then replaces the placeholder token embeddings with those image features.
- It hard-checks that placeholder-token count matches feature count and throws if not.
You can see the check in get_placeholder_mask() (this is exactly the error you hit): it compares the number of <image> placeholder slots against the flattened image feature length and raises ValueError if they differ. (GitHub)
Also, LLaVA commonly drops the CLIP CLS token when vision_feature_select_strategy == "default", so the expected number of visual “patch tokens” is typically (H/patch)*(W/patch) (plus any “additional image tokens”). (GitHub)
Implication for cache steering: the KV entries you want to steer correspond to those placeholder positions after the merge, so you must correctly build the prompt/processor so that the placeholder positions exist and match what the vision tower produces.
B) cache_position becomes empty → IndexError: cache_position[-1] … size 0
This is a generation-side failure: when generate() is called with past_key_values, some builds/paths can end up with an empty cache_position tensor, which later crashes at cache_position[-1].
There are two relevant references:
- HF docs:
cache_positionmust always be valid and advances by 1 per token; e.g. if cache has 10 tokens, next token must usetorch.tensor([10]). (Hugging Face) - A reported bug: “cache position incorrectly inferred for generation” when
past_key_valuesis provided, leading to exactly yourIndexError. (GitHub)
(There are also forum reports of the sameIndexErrorpattern. (Hugging Face Forums))
Implication: even if you pass a non-empty cache_position, you can still hit this if generate() internally overwrites/infers it incorrectly in some paths. The most robust workaround is: don’t rely on generate() for the “resume from modified cache” step—use a small manual decode loop (greedy or sampling) where you control cache_position.
Answers to your 3 questions
1) Correct way to modify the KV cache between prefill and generation (Transformers v5 / Cache objects)
Key points
- In Transformers v5, LLaVA returns
past_key_valuesas aCacheinstance (not the old tuple-of-tuples). (GitHub) - The cache stores tensors shaped
[batch, num_heads, seq_len, head_dim]. (Hugging Face) - The cache is structured as layers; docs show
cache.layers[idx].keys/cache.layers[idx].values. (Hugging Face)
Recommended cache-edit pattern (in-place)
Prefill → find visual positions → edit those positions in cache.layers[l].keys/values → resume decoding.
Important: avoid converting to legacy format unless you must. Conversions are a common place to lose metadata / end up with wrong “seen token” bookkeeping, which increases the chance generate() later infers cache_position incorrectly.
Practical steering index logic:
- Get the token id used for images:
image_token_id = model.config.image_token_id(LLaVA uses it in the forward merge). (GitHub) - Find visual token positions from the prefill input_ids (the sequence whose KV you actually cached):
visual_pos = (prefill_input_ids[0] == image_token_id).nonzero().squeeze(-1)
Then edit (per layer):
cache.layers[layer].keys[:, :, visual_pos, :] += delta_kcache.layers[layer].values[:, :, visual_pos, :] += delta_v
Two critical alignment rules
Rule A (prompt split): If you will “resume” with seed_ids = last_token, then the cache must contain everything before that last token.
So prefill on input_ids[:, :-1], not the full prompt.
Rule B (visual positions): Compute positions against the sequence you actually cached (input_ids[:, :-1]), and if needed clamp to < prefill_len.
2) Do you need to pass pixel_values to generate() if image tokens are already in the cache?
If your prefill already ran with pixel_values and you are resuming from that cache:
- Do not pass
pixel_valuesagain for the resume step. - Also do not include fresh
<image>placeholders in the resumeinput_ids(your resume token should be normal text).
Why: LlavaForConditionalGeneration.prepare_inputs_for_generation() explicitly forwards pixel_values only on the first iteration, and notes that “first iteration” can mean “continue generate from cache.” (GitHub)
If you pass pixel_values on a resume call where your new input_ids contain 0 image tokens, you can trigger the mismatch check (“tokens: 0, features: …”). (GitHub)
So: prefill step uses pixel_values; resume step should not.
3) How to resolve the cache_position IndexError (model thinks cache is empty / misaligned)
What cache_position must be
Docs specify:
- If cache already has
Ntokens and you are now processingKnew tokens,cache_positionmust be[N, ..., N+K-1]. (Hugging Face) - Attention mask length must match
past_kv_length + new_tokens_length. (Hugging Face)
The “most correct” resume formula for your setup
If you prefill on input_ids[:, :-1]:
prefill_len = input_ids.shape[1] - 1seed_ids = input_ids[:, -1:]cache_position = torch.tensor([prefill_len], device=...)attention_maskshould have lengthprefill_len + 1(typically just the original attention_mask, unchanged, if there’s no padding)
Why you still may crash with generate()
There is a known report that generate() can infer/overwrite cache_position incorrectly when past_key_values is passed, leading to empty cache_position and then the same IndexError you saw. (GitHub)
This is consistent with your trace.
Robust fix: avoid generate() for the resume step
Implement a small decode loop using model() directly (the HF caching guide even shows a manual loop and emphasizes correct attention_mask + cache_position handling). (Hugging Face)
This also makes KV steering easier (you control exactly when cache is read/written).
Minimal “known-good” pipeline for LLaVA visual KV steering
-
Build inputs correctly
- Prompt must include
<image>placeholders. - Ensure processor is configured so it expands placeholders consistently with the vision tower settings (patch size / strategy). The LLaVA docs explicitly expose these knobs on the processor. (Hugging Face)
- Avoid truncation that could drop image placeholders.
- Prompt must include
-
Prefill (cache prompt minus last token)
- Run
model(input_ids[:, :-1], pixel_values=..., attention_mask[:, :-1], use_cache=True, return_dict=True)
- Run
-
Locate visual token positions
visual_pos = where(prefill_input_ids == image_token_id)- Verify
len(visual_pos)equals what the model expects for your resolution/patching.
-
Edit cache in place
- Add steering deltas at
seq_lenindices =visual_posfor each layer.
- Add steering deltas at
-
Resume decoding with a manual loop
- Start from
seed_ids = input_ids[:, -1:] - Use
cache_position = tensor([prefill_len]), increment by 1 each step. - Do not pass
pixel_values.
- Start from
Similar cases / references worth reading
- Transformers caching guide: explains
cache_position, cache tensor shapes, layer storage, and a manual decode loop. (Hugging Face) - LLaVA forward + generation code: shows the placeholder-token vs image-feature check and the “pixel_values only on first iteration” rule. (GitHub)
- Bug report on empty
cache_positionduringgenerate(past_key_values=...)leading toIndexError. (GitHub) - HF forum thread with the same
IndexErrorsignature. (Hugging Face Forums)