Custom KV Cache Steering Implementation Fails with IndexError in LLaVA Generation

John6666 · January 28, 2026, 4:25am

I could reproduce the error.

What’s going on (two separate failure modes)

A) LLaVA “image tokens vs image features” mismatch (ValueError)

In transformers’ LLaVA implementation, the image is not “implicitly in the cache.” Instead:

Your prompt must contain image_token_id placeholders.
During the forward pass, LLaVA computes image features from pixel_values.
It then replaces the placeholder token embeddings with those image features.
It hard-checks that placeholder-token count matches feature count and throws if not.

You can see the check in get_placeholder_mask() (this is exactly the error you hit): it compares the number of <image> placeholder slots against the flattened image feature length and raises ValueError if they differ. (GitHub)

Also, LLaVA commonly drops the CLIP CLS token when vision_feature_select_strategy == "default", so the expected number of visual “patch tokens” is typically (H/patch)*(W/patch) (plus any “additional image tokens”). (GitHub)

Implication for cache steering: the KV entries you want to steer correspond to those placeholder positions after the merge, so you must correctly build the prompt/processor so that the placeholder positions exist and match what the vision tower produces.

B) `cache_position` becomes empty → `IndexError: cache_position[-1] … size 0`

This is a generation-side failure: when generate() is called with past_key_values, some builds/paths can end up with an empty cache_position tensor, which later crashes at cache_position[-1].

There are two relevant references:

HF docs: cache_position must always be valid and advances by 1 per token; e.g. if cache has 10 tokens, next token must use torch.tensor([10]). (Hugging Face)
A reported bug: “cache position incorrectly inferred for generation” when past_key_values is provided, leading to exactly your IndexError. (GitHub)
(There are also forum reports of the same IndexError pattern. (Hugging Face Forums))

Implication: even if you pass a non-empty cache_position, you can still hit this if generate() internally overwrites/infers it incorrectly in some paths. The most robust workaround is: don’t rely on generate() for the “resume from modified cache” step—use a small manual decode loop (greedy or sampling) where you control cache_position.

Answers to your 3 questions

1) Correct way to modify the KV cache between prefill and generation (Transformers v5 / Cache objects)

Key points

In Transformers v5, LLaVA returns past_key_values as a Cache instance (not the old tuple-of-tuples). (GitHub)
The cache stores tensors shaped [batch, num_heads, seq_len, head_dim]. (Hugging Face)
The cache is structured as layers; docs show cache.layers[idx].keys / cache.layers[idx].values. (Hugging Face)

Recommended cache-edit pattern (in-place)

Prefill → find visual positions → edit those positions in cache.layers[l].keys/values → resume decoding.

Important: avoid converting to legacy format unless you must. Conversions are a common place to lose metadata / end up with wrong “seen token” bookkeeping, which increases the chance generate() later infers cache_position incorrectly.

Practical steering index logic:

Get the token id used for images: image_token_id = model.config.image_token_id (LLaVA uses it in the forward merge). (GitHub)
Find visual token positions from the prefill input_ids (the sequence whose KV you actually cached): visual_pos = (prefill_input_ids[0] == image_token_id).nonzero().squeeze(-1)

Then edit (per layer):

cache.layers[layer].keys[:, :, visual_pos, :] += delta_k
cache.layers[layer].values[:, :, visual_pos, :] += delta_v

Two critical alignment rules

Rule A (prompt split): If you will “resume” with seed_ids = last_token, then the cache must contain everything before that last token.
So prefill on input_ids[:, :-1], not the full prompt.

Rule B (visual positions): Compute positions against the sequence you actually cached (input_ids[:, :-1]), and if needed clamp to < prefill_len.

2) Do you need to pass `pixel_values` to `generate()` if image tokens are already in the cache?

If your prefill already ran with pixel_values and you are resuming from that cache:

Do not pass pixel_values again for the resume step.
Also do not include fresh <image> placeholders in the resume input_ids (your resume token should be normal text).

Why: LlavaForConditionalGeneration.prepare_inputs_for_generation() explicitly forwards pixel_values only on the first iteration, and notes that “first iteration” can mean “continue generate from cache.” (GitHub)
If you pass pixel_values on a resume call where your new input_ids contain 0 image tokens, you can trigger the mismatch check (“tokens: 0, features: …”). (GitHub)

So: prefill step uses pixel_values; resume step should not.

3) How to resolve the `cache_position` IndexError (model thinks cache is empty / misaligned)

What `cache_position` must be

Docs specify:

If cache already has N tokens and you are now processing K new tokens, cache_position must be [N, ..., N+K-1]. (Hugging Face)
Attention mask length must match past_kv_length + new_tokens_length. (Hugging Face)

The “most correct” resume formula for your setup

If you prefill on input_ids[:, :-1]:

prefill_len = input_ids.shape[1] - 1
seed_ids = input_ids[:, -1:]
cache_position = torch.tensor([prefill_len], device=...)
attention_mask should have length prefill_len + 1 (typically just the original attention_mask, unchanged, if there’s no padding)

Why you still may crash with `generate()`

There is a known report that generate() can infer/overwrite cache_position incorrectly when past_key_values is passed, leading to empty cache_position and then the same IndexError you saw. (GitHub)
This is consistent with your trace.

Robust fix: avoid `generate()` for the resume step

Implement a small decode loop using model() directly (the HF caching guide even shows a manual loop and emphasizes correct attention_mask + cache_position handling). (Hugging Face)

This also makes KV steering easier (you control exactly when cache is read/written).

Minimal “known-good” pipeline for LLaVA visual KV steering

Build inputs correctly
- Prompt must include <image> placeholders.
- Ensure processor is configured so it expands placeholders consistently with the vision tower settings (patch size / strategy). The LLaVA docs explicitly expose these knobs on the processor. (Hugging Face)
- Avoid truncation that could drop image placeholders.
Prefill (cache prompt minus last token)
- Run model(input_ids[:, :-1], pixel_values=..., attention_mask[:, :-1], use_cache=True, return_dict=True)
Locate visual token positions
- visual_pos = where(prefill_input_ids == image_token_id)
- Verify len(visual_pos) equals what the model expects for your resolution/patching.
Edit cache in place
- Add steering deltas at seq_len indices = visual_pos for each layer.
Resume decoding with a manual loop
- Start from seed_ids = input_ids[:, -1:]
- Use cache_position = tensor([prefill_len]), increment by 1 each step.
- Do not pass pixel_values.

Similar cases / references worth reading

Transformers caching guide: explains cache_position, cache tensor shapes, layer storage, and a manual decode loop. (Hugging Face)
LLaVA forward + generation code: shows the placeholder-token vs image-feature check and the “pixel_values only on first iteration” rule. (GitHub)
Bug report on empty cache_position during generate(past_key_values=...) leading to IndexError. (GitHub)
HF forum thread with the same IndexError signature. (Hugging Face Forums)

Topic		Replies	Views
IndexError: index -1 is out of bounds for dimension 0 with size 0 🤗Transformers	3	165	November 7, 2025
[LLaVA-1.5] Validating Logic for Token-Level KV Cache Extraction 🤗Transformers	3	41	November 20, 2025
Significant generation degradation and repetition loops when enabling KV-cache for Qwen3-VL 🤗Transformers	2	149	December 29, 2025
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	496	January 22, 2025
Transformer KV-Cache Produces Worse Output Than Normal Generation – Why? Beginners	1	491	March 3, 2025