I did a little experiment with a smaller LLaVA model.
Why your steering helps CHAIR captioning but not POPE Yes/No
A useful framing is that these two benchmarks stress different failure modes:
- CHAIR (captioning) is dominated by multi-token drift: the model starts grounded, then language priors (co-occurrence, âtableâchairsâ) gradually take over during longer decoding. CHAIR explicitly measures mentioning objects not in the image in open-ended captions. (aclanthology.org)
- POPE (Yes/No) turns hallucination into a binary classification: object present? The output is often decided by the first answer token (âYesâ vs âNoâ), and POPE even tracks âYes-ratioâ bias because models can overproduce âYes.â (aclanthology.org)
Your intervention (ârestore attention to image tokens when it dropsâ) is naturally aligned with drift (captioning) but only weakly aligned with binary decision (POPE), especially when the decision is effectively taken at step-0.
A. The âPrefill gapâ: why decode-only steering often canât flip Yes/No
What prefill does (and why it matters)
In standard autoregressive generation, the prefill phase processes the whole prompt to:
- build the KV cache, and
- compute logits for the first generated token. (Hugging Face)
For a Yes/No task, that first generated token is often the decision itself. Once the model has produced a very confident step-0 margin (Yes vs No), decode-time-only âlook harderâ rarely changes it unless you recompute the step-0 logits under the intervention.
Evidence that this is the right lever
Many âtraining-free hallucination fixesâ that improve POPE-style behavior operate by altering the output distribution early, not only by keeping attention on image tokens late:
- VCD (Visual Contrastive Decoding) explicitly contrasts logits from the real image vs a distorted image to reduce unimodal/language priors and statistical bias. This is a logit-space correction that directly changes token selection. (CVF Open Access)
Practical implication for your method
If your steering is only active during decoding (after prefill), you should expect:
- Captioning gains (because you correct drift over many tokens)
- Limited Yes/No gains whenever the step-0 decision is already confident
Debug check: always compute the step-0 Yes/No logit margin (as you did). If the margin is large before decoding, decode-only steering is unlikely to flip it.
B. The âsignal of absenceâ: why âhigh attention somewhereâ doesnât mean âobject absentâ
Attention â evidence
Even if you successfully push attention mass onto image tokens, that does not guarantee the model has formed an internal representation like âdog absent.â Attention is a routing mechanism; its weights are not reliably a causal explanation of the decision. (arXiv)
So it is possible (and common) to observe:
- high attention to image tokens
- unchanged Yes/No decision
Absence/negation is genuinely hard for VLMs
A big part of âobject absenceâ is a form of negation reasoning (âthere is no dogâ). Multiple recent results show modern vision-language models struggle with negation and ânot/withoutâ-style semantics. (CVF Open Access)
That means âforce lookingâ can easily lead to:
- attention moving to background or other objects (cats, blanket),
- but the model still defaulting to a biased âYes,â especially on adversarial/negative POPE queries.
What tends to work better for absence queries
Methods that create a counterfactual or contrastive baseline provide a more meaningful âabsence signalâ than âattention anywhereâ:
- Compare logits for real image vs null / heavily corrupted image (or âblur/gray/zeroâ) and use the difference as âvisual evidence strength.â
- VCD is a canonical example of this idea in decoding. (CVF Open Access)
C. Layer specificity: âgrounding headsâ vs âanswering headsâ is not cleanly separable
LLaVA mixing makes late attention a weak handle on decisions
LLaVA-style models insert projected vision tokens into the LLM sequence and then use standard transformer blocks to mix modalities throughout the residual stream. (arXiv)
So even if some heads appear âvisual,â it doesnât guarantee:
- they causally control the Yes/No logits,
- or that boosting them late will move the decision.
Token pruning results support âvision influence saturates early/midâ
Work on LLaVA-1.5 vision token pruning shows you can drop a large fraction of vision tokens by mid layers with minimal accuracy loss. That strongly suggests that by mid-depth the model has already extracted what it will use, and later layers are often dominated by language-side consolidation. (arXiv)
Implication: steering layers 12â28 may stabilize captioning trajectories (good) but still fail to change the early representation that controls a step-0 Yes/No decision (bad for POPE).
How to debug the âGrounding vs Faithfulnessâ gap efficiently
1) Convert POPE into a pure step-0 classification probe
- Force
max_new_tokens=1
- Score a Yes token set vs No token set at step-0 (logsumexp over variants like âYes/ yesâ, âNo/ noâ).
- Track POPEâs Yes-ratio as a bias indicator. (aclanthology.org)
This removes generation noise and makes âdid the decision move?â unambiguous.
2) Use a stronger âvision removedâ baseline than light Gaussian noise
Your earlier run showed small sensitivity between real and mildly distorted input (common if distortion is too weak). For diagnosis, use one of:
pixel_values = 0 (hard null)
- very large noise (sigma 1â2 in normalized space)
- heavy blur + downsample/upscale
The goal is not realism; itâs to approximate âno usable vision signal.â
3) Bucket failures into two categories (this saves time)
For each example compute:
m_text (text-only prior margin)
m_real (real image margin)
m_null (null/corrupted image margin)
Then classify:
- Prior-dominated:
m_real â m_text and m_real â m_null
â vision isnât moving step-0; decode-time attention fixes wonât help much.
- Vision-sensitive but wrong:
m_real differs from m_null, but the answer is still wrong
â likely perception/recognition failure or dataset ambiguity; attention boosting wonât fix missing features.
4) Test whether your method can affect step-0 at all
Do a controlled experiment:
- run prefill to just before the first answer token,
- apply your steering (KV/attention modification),
- recompute the step-0 logits.
If step-0 doesnât move, your mechanism is operating âtoo lateâ for POPE by construction.
5) Consider a hybrid: keep your captioning steering, add a Yes/No logit correction
A common pattern is:
- Captioning: keep your decode-time grounding (drift control)
- Yes/No: apply a step-0 correction (contrastive/logit calibration)
VCD is a good reference design for the Yes/No side because it directly targets unimodal priors by contrasting real vs distorted distributions. (CVF Open Access)
6) Donât overinterpret âattention to backgroundâ
Given the literature on attention not being a faithful explanation, treat âattention moved to backgroundâ as a symptom (the model is forced to route attention somewhere) rather than a decision signal (âtherefore it believes presence/absenceâ). (arXiv)
Where your current observations fit relative to known methods
Your captioning success resembles mechanisms described in decoding-focused hallucination work:
- OPERA reports hallucinations correlate with attention patterns that over-trust a few summary tokens during decoding and proposes decoding-time penalties/rollback. This kind of mechanism is naturally aligned with long-form drift. (CVF Open Access)
Your POPE failure aligns with what POPE and VCD emphasize:
- POPE frames hallucination as binary probing and highlights bias and co-occurrence effects in LVLMs. (arXiv)
- VCD targets unimodal priors/statistical bias via contrastive distributionsâprecisely the regime where âlooking moreâ is insufficient. (CVF Open Access)
Concrete next step
Run 50â200 POPE negatives and log three margins per example (m_text, m_real, m_null). Then answer two questions:
- How many failures are prior-dominated? (vision not affecting step-0)
- When you apply your steering during prefill, does
m_real move?
If (1) is large and (2) is small, the right fix is not âmore visual attention,â but step-0 logit calibration (VCD-like) or prefill-time intervention (recomputing step-0 under the intervention).
# Single-file demo (no argparse): LLaVA step-0 Yes/No diagnostics + caption sanity
#
# Deps:
# pip -q install -U "transformers>=4.40" "accelerate>=0.25" "bitsandbytes>=0.43" pillow requests
#
# Notes:
# - 4-bit NF4 loading via BitsAndBytesConfig. :contentReference[oaicite:3]{index=3}
# - For Intel/llava-gemma-2b, prompt formatting is via tokenizer.apply_chat_template. :contentReference[oaicite:4]{index=4}
# - LLaVA docs warn some checkpoints require setting processor.patch_size / num_additional_image_tokens /
# vision_feature_select_strategy. :contentReference[oaicite:5]{index=5}
# - If you see token/feature mismatch, it can also be caused by truncation of expanded image placeholders;
# this demo disables truncation and bumps model_max_length. :contentReference[oaicite:6]{index=6}
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig
# -------------------------
# Config (edit here)
# -------------------------
MODEL_ID = "Intel/llava-gemma-2b" # primary demo model (you already downloaded it)
# If you want a non-Gemma fallback that is typically stable in HF LLaVA format:
# MODEL_ID = "xtuner/llava-phi-3-mini-hf"
IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"
QUESTION_YN = "Is there a dog in this image? Answer Yes or No."
QUESTION_CAPTION = "Describe this image."
DISTORT_K = 3
DISTORT_SIGMA = 0.20
DISTORT_SEED = 0
GEN_MAX_NEW_TOKENS = 16
GEN_DO_SAMPLE = False
DTYPE = torch.float16
# Memory safety (T4): leave headroom for activations
MAX_MEMORY = {0: "13GiB", "cpu": "10GiB"}
# -------------------------
# Utilities
# -------------------------
def print_cuda_info():
if torch.cuda.is_available():
p = torch.cuda.get_device_properties(0)
print(f"GPU: {p.name} | VRAM: {p.total_memory/(1024**3):.1f} GB")
print(f"torch: {torch.__version__} | device: {torch.cuda.current_device()}")
else:
print("CUDA not available.")
def cuda_mem(tag=""):
if not torch.cuda.is_available():
return
a = torch.cuda.memory_allocated() / (1024**3)
r = torch.cuda.memory_reserved() / (1024**3)
print(f"[mem]{' '+tag if tag else ''} allocated={a:.2f}GB reserved={r:.2f}GB")
def load_image(url: str) -> Image.Image:
r = requests.get(url, stream=True, timeout=30)
r.raise_for_status()
return Image.open(r.raw).convert("RGB")
def make_prompt(tokenizer, question: str, with_image: bool) -> str:
# Intel/llava-gemma-2b model card uses this exact pattern. :contentReference[oaicite:7]{index=7}
if hasattr(tokenizer, "apply_chat_template") and getattr(tokenizer, "chat_template", None):
content = f"<image>\n{question}" if with_image else question
msgs = [{"role": "user", "content": content}]
return tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# Fallback (debug-only)
return f"USER: {'<image>\\n' if with_image else ''}{question}\nASSISTANT:"
def first_token_ids(tokenizer, strings):
ids = []
for s in strings:
toks = tokenizer(s, add_special_tokens=False).input_ids
if toks:
ids.append(toks[0])
return sorted(set(ids))
def logsumexp_ids(logits_1d, ids):
x = logits_1d[ids]
return torch.logsumexp(x, dim=0)
def yesno_margin_step0(logits_1d, yes_ids, no_ids) -> float:
y = logsumexp_ids(logits_1d, yes_ids)
n = logsumexp_ids(logits_1d, no_ids)
return float((y - n).item())
def get_image_token_id(model, tokenizer):
itok = getattr(model.config, "image_token_id", None) or getattr(model.config, "image_token_index", None)
if itok is not None:
return int(itok)
# last-resort fallback
tid = tokenizer.convert_tokens_to_ids("<image>")
return int(tid) if tid is not None else None
def fix_processor_fields(processor, model):
"""
Fix missing processor fields (patch_size=None etc.). This is recommended by LLaVA docs and
is a known issue for LLaVA-Gemma processors. :contentReference[oaicite:8]{index=8}
"""
# patch_size
ps = getattr(processor, "patch_size", None)
if ps is None:
ps = getattr(getattr(model.config, "vision_config", None), "patch_size", None)
if ps is None:
ps = 14
processor.patch_size = int(ps)
# feature select strategy
vsel = getattr(processor, "vision_feature_select_strategy", None)
if vsel is None:
vsel = getattr(model.config, "vision_feature_select_strategy", None)
if vsel is None:
vsel = "default"
processor.vision_feature_select_strategy = str(vsel)
# additional image tokens
# IMPORTANT: many VLM encoders include a CLS token; if this is wrong, you get a +1 mismatch. :contentReference[oaicite:9]{index=9}
nai = getattr(processor, "num_additional_image_tokens", None)
if nai is None:
nai = 1
processor.num_additional_image_tokens = int(nai)
print(
f"[processor fix] patch_size={processor.patch_size}, "
f"num_additional_image_tokens={processor.num_additional_image_tokens}, "
f"vision_feature_select_strategy={processor.vision_feature_select_strategy}"
)
@torch.inference_mode()
def vision_token_count(model, pixel_values: torch.Tensor) -> int:
"""
Count vision tokens produced by the vision tower for these pixel_values.
We use get_image_features (used internally by LLaVA models). :contentReference[oaicite:10]{index=10}
"""
feats = model.get_image_features(pixel_values=pixel_values, return_dict=True).pooler_output
# pooler_output can be list[Tensor] or Tensor; shapes can be (B,N,D) or (N,D) depending on model wiring.
if isinstance(feats, (list, tuple)):
f = feats[0]
else:
f = feats
if f.ndim == 3:
return int(f.shape[1])
if f.ndim == 2:
return int(f.shape[0])
raise RuntimeError(f"Unexpected feature shape: {tuple(f.shape)}")
@torch.inference_mode()
def encode_mm_with_autoalign(model, processor, prompt: str, image: Image.Image, max_tries: int = 3):
"""
Encode multimodal inputs and auto-fix the common off-by-one mismatch:
placeholders (image tokens) != vision tokens
This avoids the forward() crash "Image features and image tokens do not match". :contentReference[oaicite:11]{index=11}
"""
tok = processor.tokenizer
itok = get_image_token_id(model, tok)
# Avoid truncation of expanded image placeholder tokens. :contentReference[oaicite:12]{index=12}
if getattr(tok, "model_max_length", 0) and tok.model_max_length < 8192:
tok.model_max_length = 8192
for attempt in range(1, max_tries + 1):
enc = processor(
text=prompt,
images=image,
return_tensors="pt",
truncation=False,
padding=False,
)
enc = {k: v.to(model.device) for k, v in enc.items()}
if "pixel_values" in enc:
enc["pixel_values"] = enc["pixel_values"].to(DTYPE)
if itok is None:
return enc # cannot count placeholders; proceed
n_place = int((enc["input_ids"] == itok).sum().item())
n_feat = vision_token_count(model, enc["pixel_values"])
if n_place == n_feat:
if attempt > 1:
print(f"[autoalign] fixed after {attempt-1} adjustment(s): placeholders={n_place}, feat_tokens={n_feat}")
return enc
delta = n_feat - n_place
print(f"[autoalign] mismatch attempt {attempt}: placeholders={n_place}, feat_tokens={n_feat}, delta={delta}")
# adjust additional image tokens to close the gap (often delta==+1)
processor.num_additional_image_tokens = int(getattr(processor, "num_additional_image_tokens", 0) + delta)
raise ValueError("Could not auto-align image tokens/features after retries; check prompt (<image>) and processor settings.")
@torch.inference_mode()
def step0_mm(model, processor, prompt, image):
enc = encode_mm_with_autoalign(model, processor, prompt, image)
out = model(**enc, use_cache=False, return_dict=True)
return out.logits[0, -1].float().cpu(), enc
@torch.inference_mode()
def step0_text(model, processor, prompt):
enc = processor(text=prompt, return_tensors="pt", truncation=False, padding=False)
enc = {k: v.to(model.device) for k, v in enc.items()}
out = model(**enc, use_cache=False, return_dict=True)
return out.logits[0, -1].float().cpu()
@torch.inference_mode()
def step0_distorted_avg(model, enc_mm, k=3, sigma=0.2, seed=0):
torch.manual_seed(seed)
input_ids = enc_mm["input_ids"]
attention_mask = enc_mm.get("attention_mask", None)
pixel_values = enc_mm["pixel_values"]
acc = None
for _ in range(k):
noise = torch.randn_like(pixel_values) * sigma
pv = (pixel_values + noise).clamp(-3.0, 3.0)
kwargs = {"input_ids": input_ids, "pixel_values": pv}
if attention_mask is not None:
kwargs["attention_mask"] = attention_mask
out = model(**kwargs, use_cache=False, return_dict=True)
logits = out.logits[0, -1].float()
acc = logits if acc is None else (acc + logits)
return (acc / k).cpu()
@torch.inference_mode()
def generate_mm(model, processor, prompt, image, max_new_tokens=16, do_sample=False):
enc = encode_mm_with_autoalign(model, processor, prompt, image)
gen = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=do_sample, use_cache=True)
return processor.batch_decode(gen, skip_special_tokens=True)[0]
# -------------------------
# Main
# -------------------------
def main():
torch.backends.cuda.matmul.allow_tf32 = True
print_cuda_info()
qconf = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=DTYPE,
bnb_4bit_use_double_quant=True,
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
tok = processor.tokenizer
if tok is not None:
tok.padding_side = "left"
model = LlavaForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
max_memory=MAX_MEMORY,
torch_dtype="auto",
quantization_config=qconf,
)
model.eval()
cuda_mem("after load")
fix_processor_fields(processor, model)
image = load_image(IMAGE_URL)
print(f"Image loaded: {IMAGE_URL} size={image.size}")
# Step-0 Yes/No token IDs (first-token only)
yes_ids = first_token_ids(tok, ["Yes", " yes"])
no_ids = first_token_ids(tok, ["No", " no"])
print(f"Yes first-token ids: {yes_ids}")
print(f"No first-token ids: {no_ids}")
prompt_yn_mm = make_prompt(tok, QUESTION_YN, with_image=True)
prompt_yn_txt = make_prompt(tok, QUESTION_YN, with_image=False)
prompt_cap_mm = make_prompt(tok, QUESTION_CAPTION, with_image=True)
print("\n=== YES/NO: step-0 (prefill) diagnostics ===")
logits_real, enc_mm = step0_mm(model, processor, prompt_yn_mm, image)
logits_text = step0_text(model, processor, prompt_yn_txt)
logits_dist = step0_distorted_avg(model, enc_mm, k=DISTORT_K, sigma=DISTORT_SIGMA, seed=DISTORT_SEED)
m_real = yesno_margin_step0(logits_real, yes_ids, no_ids)
m_text = yesno_margin_step0(logits_text, yes_ids, no_ids)
m_dist = yesno_margin_step0(logits_dist, yes_ids, no_ids)
sens = abs(m_real - m_dist)
print(f"m_real (MM) = {m_real:+.3f}")
print(f"m_text (text-only) = {m_text:+.3f}")
print(f"m_dist (distorted) = {m_dist:+.3f} (K={DISTORT_K}, sigma={DISTORT_SIGMA})")
print(f"sensitivity |m_real - m_dist| = {sens:.3f}")
print("\n[gen] Yes/No output (short):")
print(generate_mm(model, processor, prompt_yn_mm, image, max_new_tokens=GEN_MAX_NEW_TOKENS, do_sample=GEN_DO_SAMPLE))
cuda_mem("after yes/no")
print("\n=== CAPTION: short generation ===")
print(generate_mm(model, processor, prompt_cap_mm, image, max_new_tokens=GEN_MAX_NEW_TOKENS, do_sample=GEN_DO_SAMPLE))
cuda_mem("done")
if __name__ == "__main__":
main()