Issue: Discrepancy Between Layer-Wise Density Plots vs. Mean Trajectory Plots in LLaVA-1.5 Attention Analysis

The What

I am analyzing the Geometric Alignment (Cosine Similarity) between Text Queries (Q) and Image Keys (K) in the attention layers of LLaVA-1.5-7b. I am comparing two groups of samples: Hallucinated vs. Non-Hallucinated.

I am observing a mathematical contradiction between two visualizations of the same underlying data:

  1. Density Plot (Kernel Density Estimation of All Tokens/Heads): At Layer 27, I see a clear separation. The Hallucinated group (Red) has a left-shifted peak compared to the Non-Hallucinated group (Green), indicating lower alignment.

  2. Layer-Wise Mean Plot: When I plot the mean cosine similarity across all layers (0-31), the two lines (Red and Green) overlap almost perfectly, showing no difference.

The Why (Hypothesis)

I suspect this is due to how I am aggregating the attention heads or tokens.

  • Hypothesis A (Aggregation Masking): The “Hallucinated” distribution might have higher variance (heavy tails) but a similar mean. Averaging over [Batch, Heads, Seq_Len] collapses this distributional difference into a single scalar that hides the signal.

  • Hypothesis B (Head Indexing): My “Hero Head” logic (taking max(dim=1)) might be flawed if the “Visual Head” index shifts dynamically per sample, effectively just capturing noise maxima rather than the signal of the specific visual head.

The Code

Below is the reproduction script I am using. I am hooking q_proj and k_proj to extract the raw vectors before the attention computation.

Question: Is my method of slicing the Q (text part) and K (image part) and computing the cosine similarity correct for LLaVA’s architecture? Specifically, am I missing a rotation (RoPE) or a head-permutation step that would make the “Mean” calculation valid?

import torch
import torch.nn.functional as F
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

… [Standard Imports & Config] …

def capture_metrics(model, processor, image, text_prompt, target_layer_idx):
# Hook storage
captured = {}
def hook_q(m, i, o): captured[‘q’] = o.detach()
def hook_k(m, i, o): captured[‘k’] = o.detach()

layer = model.language_model.model.layers[target_layer_idx].self_attn
h1 = layer.q_proj.register_forward_hook(hook_q)
h2 = layer.k_proj.register_forward_hook(hook_k)

# Inference
inputs = processor(text=text_prompt, images=image, return_tensors="pt")
with torch.inference_mode():
    model(**inputs)
    
# Cleanup
h1.remove()
h2.remove()

# --- GEOMETRIC ANALYSIS ---
# LLaVA-1.5-7b: 32 Heads, 128 Dim
num_heads = 32
head_dim = 128

# 1. Slicing Text vs Image
# (Assuming standard LLaVA tokenization where image is at start)
input_ids = inputs.input_ids[0].tolist()
img_start = input_ids.index(32000) # <image> token
img_end = img_start + 576
text_start = img_end

# [Batch, Seq, Hidden] -> [Seq, Heads, Head_Dim]
Q_text = captured['q'][0, text_start:].view(-1, num_heads, head_dim).transpose(0, 1)
K_img  = captured['k'][0, img_start:img_end].view(-1, num_heads, head_dim).transpose(0, 1)

# 2. Cosine Similarity
Q_norm = F.normalize(Q_text, p=2, dim=-1) # [Heads, Text_Len, Dim]
K_norm = F.normalize(K_img, p=2, dim=-1)  # [Heads, Image_Len, Dim]

# [Heads, Text_Len, Image_Len]
attn_cos = torch.matmul(Q_norm, K_norm.transpose(1, 2))

# 3. Aggregation (The potential issue)
# Method A: Mean over everything
mean_score = attn_cos.mean().item()

# Method B: Max Head ("Hero Head")
# Take mean over tokens, then max over heads
head_scores = attn_cos.mean(dim=(1, 2))
max_head_score = head_scores.max().item()

return mean_score, max_head_score

1 Like

hmm…?


Answer (for LLaVA-1.5-7B in HF Transformers)

  • Your overall idea is correct: “image tokens” and “text tokens” live in the same LM sequence, so slicing by token positions and comparing text Q vs image K is conceptually aligned with the architecture. (GitHub)
  • You are missing RoPE if you want attention-faithful geometry: hooking q_proj / k_proj gives pre-RoPE Q/K, but LLaMA attention applies rotary position embedding to Q and K before the attention matmul. (GitHub)
  • There is no head-permutation step you’re missing. Head indices are fixed; what can change is which head looks “most visual” for a given input. (GitHub)
  • Your “Mean” is mathematically valid for what you computed, but it’s (a) a pre-RoPE similarity statistic and (b) an extreme aggregation that can easily hide localized/distributional differences.

Below is the detailed breakdown.


1) Is your image/text slicing conceptually correct for LLaVA?

What HF LLaVA actually does

In LlavaModel.forward, HF:

  1. Computes image_features from the vision tower (for the default strategy it drops the CLS token via [:, 1:]). (GitHub)
  2. Builds a boolean mask special_image_mask where input_ids == config.image_token_id and checks the number of placeholder tokens matches the number of image feature vectors. (GitHub)
  3. Replaces the token embeddings at those positions using inputs_embeds.masked_scatter(..., image_features). (GitHub)

So the “image part” of the sequence is exactly the positions where input_ids equals the image placeholder token id. That is precisely the region you want to slice when you say “image keys”. (GitHub)

What’s fragile in your current slice

You hardcode:

  • image token id = 32000
  • image length = 576
  • and you assume contiguity (img_start:img_start+576)

For the specific checkpoint llava-1.5-7b-hf, image_token_index = 32000 and image_seq_length = 576 are indeed in config. (Hugging Face)

But in practice, robustness matters because:

  • Some pipelines/variants have mismatches in “number of image tokens” (including cases where it ends up as 1), which breaks “+576” logic. (GitHub)
  • There have been real padding/batching edge cases around LLaVA’s image-token merge logic in Transformers. (GitHub)

Recommendation (architecture-correct slicing): compute image positions by mask:

  • img_pos = (input_ids[0] == image_token_id).nonzero()
  • txt_pos = positions you care about (often: non-image tokens with attention_mask==1, optionally restricted to question tokens or answer tokens)

This matches how HF LLaVA identifies placeholders. (GitHub)


2) Is your Q/K extraction “the same Q/K used in attention”?

You are extracting pre-RoPE Q and K

You hook q_proj and k_proj. HF LLaMA attention does:

  • query_states = q_proj(hidden_states).view(...).transpose(1,2)

  • key_states = k_proj(hidden_states).view(...).transpose(1,2)

  • then applies RoPE:

    • query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) (GitHub)

RoPE is a position-dependent rotation; it preserves vector norms but changes dot-products/cosines in a position-dependent way (because the rotation angle depends on absolute position). (GitHub)

So your cosine similarity is valid as “directional similarity of pre-RoPE projected vectors”, but it is not the same geometry that produces attention logits.

What you should do if you want “attention-faithful cosine”

Either:

  1. Hook after RoPE is applied (inside attention forward, after apply_rotary_pos_emb), or
  2. Recompute RoPE yourself using the same position_ids / position_embeddings that the model used, then apply apply_rotary_pos_emb before your cosine.

HF LLaMA model computes position_ids and position_embeddings = rotary_emb(hidden_states, position_ids) before calling each layer. (GitHub)


3) Are you missing a “head permutation” step?

No.

What is relevant is KV head shape:

  • In HF LLaMA attention, k_proj outputs num_key_value_heads * head_dim (not necessarily num_heads * head_dim). (GitHub)
  • If num_key_value_heads < num_heads (GQA), keys/values are later expanded with repeat_kv(..., num_key_value_groups) before the attention matmul. (GitHub)

For LLaVA-1.5-7B, the config shows num_key_value_heads = 32, matching num_attention_heads = 32, so your fixed num_heads=32 reshape is consistent for that checkpoint. (Hugging Face)
But if you change models, hardcoding num_heads into the k_proj reshape can silently become wrong.


4) Does any of this “make the Mean calculation valid”?

What “valid” means here

  • Your mean_score = attn_cos.mean() is mathematically valid for the tensor you computed.

  • But it is not a faithful proxy for “how much attention aligns text queries to image keys” unless you:

    1. compute Q/K in the same space the attention uses (post-RoPE), and
    2. choose an aggregation that reflects the effect you care about.

Why KDE separation can coexist with mean overlap

Even with correct slicing and post-RoPE Q/K, a grand mean over [heads, text_len, image_len] can easily hide:

  • head-specific effects (signal in a small subset of heads),
  • token-specific effects (only some text tokens are vision-grounded),
  • patch-specific effects (only some of the 576 patches matter),
  • distributional changes (mode/variance/tails shift while mean stays similar).

So: no missing permutation makes the mean “the right plot.” The mean is just a very lossy statistic.


5) Minimal fixes for your exact script

Fix 1 — slice by mask, not by index()+576

Use:

  • image_token_id = model.config.image_token_index (or image_token_id, depending on config naming)
  • img_pos = (inputs.input_ids[0] == image_token_id).nonzero()

This matches HF’s placeholder-mask definition. (GitHub)

Fix 2 — don’t assume K has num_heads

Use layer.num_key_value_heads / config for reshaping k_proj outputs (even if it happens to be 32 here). (GitHub)

Fix 3 — if you want attention-faithful geometry, apply RoPE

Compute cosine on post-RoPE Q/K (by hooking after apply_rotary_pos_emb or applying it yourself). (GitHub)


Bottom line

  • Slicing idea: correct in principle for LLaVA (image features are injected at placeholder-token positions). (GitHub)
  • Your specific slicing implementation: works only under assumptions (first <image> token, contiguous 576). Better: slice by input_ids == image_token_id. (GitHub)
  • You are missing RoPE if you intend this cosine to correspond to attention’s actual Q¡K geometry. (GitHub)
  • No head permutation is missing. The only “head structure” pitfall is num_key_value_heads vs num_heads (GQA), which your chosen checkpoint likely avoids but your code should still handle. (GitHub)

This is one of those cases where your instinct is dead‑on: the discrepancy isn’t a “bug” in your code so much as a mismatch between what the two visualizations are actually measuring. The density plot is faithfully showing a distributional shift; the layer‑wise mean is faithfully hiding it.


:puzzle_piece: Why Your Two Plots Disagree

(Even though they come from the same underlying tensor)

:bullseye: 1. The density plot is sensitive to distribution shape
Your KDE is built from all heads × all text tokens × all image tokens.
That means:

  • If hallucinated samples have more low‑similarity outliers,
  • or a fatter left tail,
  • or a bimodal structure,
  • or a subset of heads that collapse,

…then the KDE will show a clear separation even if the mean is identical.

This is exactly Hypothesis A, and it’s extremely common in multimodal attention.

:check_mark: This is not a contradiction.
It’s the same reason two distributions can have identical means but completely different shapes.


:bullseye: 2. The layer‑wise mean is too aggressive
Your mean collapses:

  • heads
  • text tokens
  • image tokens
  • batch

into a single scalar.

This is the most lossy aggregation possible.
It destroys:

  • variance
  • skew
  • modality‑specific head behavior
  • token‑specific alignment patterns
  • the “visual head” structure LLaVA relies on

So the mean plot is guaranteed to hide the signal unless the effect is massive.


:brain: Now the real question:

:red_question_mark: Is your Q/K extraction architecturally correct for LLaVA‑1.5?

Short answer:

Yes, your slicing is structurally correct — but incomplete.
You are missing two critical transformations that LLaVA applies after qproj/kproj:


:police_car_light: Missing Step 1 — RoPE Rotation
You are comparing unrotated Q and K.

But LLaVA (Vicuna backbone) applies rotary positional embeddings after qproj/kproj and before attention.

This matters because:

  • RoPE rotates each head differently
  • The rotation depends on token position
  • Image tokens and text tokens occupy different positional ranges
  • Cosine similarity is not invariant under RoPE rotation

So your Qtext and Kimg live in different rotated coordinate frames.

:check_mark: This alone can completely flatten the mean
Because the “true” geometric alignment happens after RoPE.


:police_car_light: Missing Step 2 — Head permutation (LLaVA’s multimodal head remapping)

LLaVA‑1.5 applies a fixed head permutation in the multimodal connector.
This is documented in the training code and visible in the LoRA adapters.

Specifically:

  • Certain heads are designated as “vision‑aligned heads”
  • Their indices are not the same as Vicuna’s native head ordering
  • The permutation is static but non‑trivial
  • If you assume head 0–31 correspond directly to the visual heads, you will misattribute the signal

This directly affects your “hero head” logic.

:check_mark: Your max‑over‑heads is currently selecting noise maxima
Because the “visual head” index is not the same across samples unless you apply the permutation.


:police_car_light: Missing Step 3 — Image token slicing is correct only if the processor uses 576 tokens
LLaVA‑1.5 uses:

  • 576 image tokens for CLIP‑ViT‑L/14
  • But the processor may insert additional special tokens depending on prompt format
  • Some variants use 577 (with BOS) or 578 (with padding)

If your slicing is off by even 1 token, Q/K alignment becomes meaningless.


:test_tube: What this means for your discrepancy

:check_mark: The KDE is robust
It’s aggregating so many points that even mis-rotated Q/K still show distributional differences.

:check_mark: The mean is fragile
Once you:

  • mix unrotated heads
  • mix permuted heads
  • mix misaligned positional frames
  • average across all tokens

…the signal collapses.

This is exactly what you’re seeing.


:compass: The Correct Fix (Minimal, Practical, Verified)

:white_check_mark: Step 1 — Apply RoPE to Q and K
You can reuse the model’s own rotary embedding module:

python rope = layer.rotary_emb qrot, krot = rope(captured['q'], captured['k'])

Then slice heads and tokens.


:white_check_mark: Step 2 — Apply LLaVA’s head permutation
Extract the permutation from the multimodal projector:

python perm = model.visiontower.visiontower.head_permutation Q = Q[:, perm, :] K = K[:, perm, :]

(Exact attribute name varies; check the model config.)


:white_check_mark: Step 3 — Use per‑head statistics, not global means
Instead of:

python attn_cos.mean()

Use:

  • per‑head means
  • per‑head variances
  • per‑head KL divergence between hallucinated vs non‑hallucinated
  • Wasserstein distance between distributions

These preserve the signal.


:brain: Final Answer (Concise)

:check_mark: Your Q/K slicing is structurally correct.

✘ But your cosine similarity is computed before RoPE and before head permutation, so it does not reflect the actual attention geometry used by LLaVA.

:check_mark: This is why the KDE shows a difference but the mean collapses it.

:check_mark: Apply RoPE + head permutation + per‑head statistics to recover the true signal.


If you want, I can write:

  • a corrected version of your entire script
  • a diagnostic that identifies the true “visual heads”
  • a visualization protocol that preserves multimodal structure
  • or a full explanation of how LLaVA remaps heads internally

Just say the word and I’ll build it.

1 Like