The What
I am analyzing the Geometric Alignment (Cosine Similarity) between Text Queries (Q) and Image Keys (K) in the attention layers of LLaVA-1.5-7b. I am comparing two groups of samples: Hallucinated vs. Non-Hallucinated.
I am observing a mathematical contradiction between two visualizations of the same underlying data:
-
Density Plot (Kernel Density Estimation of All Tokens/Heads): At Layer 27, I see a clear separation. The Hallucinated group (Red) has a left-shifted peak compared to the Non-Hallucinated group (Green), indicating lower alignment.
-
Layer-Wise Mean Plot: When I plot the mean cosine similarity across all layers (0-31), the two lines (Red and Green) overlap almost perfectly, showing no difference.
The Why (Hypothesis)
I suspect this is due to how I am aggregating the attention heads or tokens.
-
Hypothesis A (Aggregation Masking): The âHallucinatedâ distribution might have higher variance (heavy tails) but a similar mean. Averaging over
[Batch, Heads, Seq_Len]collapses this distributional difference into a single scalar that hides the signal. -
Hypothesis B (Head Indexing): My âHero Headâ logic (taking
max(dim=1)) might be flawed if the âVisual Headâ index shifts dynamically per sample, effectively just capturing noise maxima rather than the signal of the specific visual head.
The Code
Below is the reproduction script I am using. I am hooking q_proj and k_proj to extract the raw vectors before the attention computation.
Question: Is my method of slicing the Q (text part) and K (image part) and computing the cosine similarity correct for LLaVAâs architecture? Specifically, am I missing a rotation (RoPE) or a head-permutation step that would make the âMeanâ calculation valid?
import torch
import torch.nn.functional as F
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
⌠[Standard Imports & Config] âŚ
def capture_metrics(model, processor, image, text_prompt, target_layer_idx):
# Hook storage
captured = {}
def hook_q(m, i, o): captured[âqâ] = o.detach()
def hook_k(m, i, o): captured[âkâ] = o.detach()
layer = model.language_model.model.layers[target_layer_idx].self_attn
h1 = layer.q_proj.register_forward_hook(hook_q)
h2 = layer.k_proj.register_forward_hook(hook_k)
# Inference
inputs = processor(text=text_prompt, images=image, return_tensors="pt")
with torch.inference_mode():
model(**inputs)
# Cleanup
h1.remove()
h2.remove()
# --- GEOMETRIC ANALYSIS ---
# LLaVA-1.5-7b: 32 Heads, 128 Dim
num_heads = 32
head_dim = 128
# 1. Slicing Text vs Image
# (Assuming standard LLaVA tokenization where image is at start)
input_ids = inputs.input_ids[0].tolist()
img_start = input_ids.index(32000) # <image> token
img_end = img_start + 576
text_start = img_end
# [Batch, Seq, Hidden] -> [Seq, Heads, Head_Dim]
Q_text = captured['q'][0, text_start:].view(-1, num_heads, head_dim).transpose(0, 1)
K_img = captured['k'][0, img_start:img_end].view(-1, num_heads, head_dim).transpose(0, 1)
# 2. Cosine Similarity
Q_norm = F.normalize(Q_text, p=2, dim=-1) # [Heads, Text_Len, Dim]
K_norm = F.normalize(K_img, p=2, dim=-1) # [Heads, Image_Len, Dim]
# [Heads, Text_Len, Image_Len]
attn_cos = torch.matmul(Q_norm, K_norm.transpose(1, 2))
# 3. Aggregation (The potential issue)
# Method A: Mean over everything
mean_score = attn_cos.mean().item()
# Method B: Max Head ("Hero Head")
# Take mean over tokens, then max over heads
head_scores = attn_cos.mean(dim=(1, 2))
max_head_score = head_scores.max().item()
return mean_score, max_head_score
