LLaVA Steering: Why does grounding fix hallucinations in captioning but not in Yes/No QA?

Hi everyone,

I am working on an inference-time steering method for LLaVA-1.5-7b to improve visual grounding. My method works by monitoring the attention layers during generation. If the model’s attention to the image features drops below a certain threshold (i.e., if it starts ignoring the image tokens), my mechanism intervenes to boost the attention scores back onto the visual tokens.

I have verified that the intervention is active and working mechanically. However, I am observing a stark contrast in how this affects downstream performance on two standard hallucination benchmarks:

1. Generative Captioning (CHAIR Benchmark): Success
In free-form captioning tasks (“Describe this image…”), the method works exactly as intended. It prevents the model from “drifting” away from the image. For example, if the model is about to hallucinate an object based on text probability (e.g., seeing a “table” implies “chairs”), the steering forces it to look back at the image pixels, effectively correcting the hallucination.

2. Binary QA (POPE Benchmark): Failure
In Yes/No probing tasks (“Is there a dog?”), the exact same mechanism fails to correct the model’s bias.

  • The Scenario: I feed the model an image without a dog and ask “Is there a dog?”.

  • The Behavior: The model initially prepares to answer “Yes” based on its internal instruction-following bias.

  • The Intervention: My method detects low attention to the image and forces the model to attend strongly to the image features.

  • The Result: PROBABLY due to the fact that the image contains no dog, the model attends to the “background.” It then still answers “Yes.”


My Hypothesis & Questions for the Community:

I suspect this is due to a fundamental difference in how VLMs handle “Grounding” (looking) vs “Answering” (deciding). I would appreciate any insights on the following:

A. The “Prefill Gap” (Timing)
In a short QA task like “Is there a dog?”, does LLaVA form its decision logic entirely during the prefill (prompt processing) stage?

  • My method currently only steers the decoding steps (token generation).

  • Question: Has anyone successfully changed a VLM’s answer by steering only the generation phase, or is the answer “baked in” to the KV cache of the prompt? Do I need to intervene on the prompt tokens themselves?

B. The “Signal of Absence”
When I force the model to attend to the image in the absence of the object, the attention mass shifts to the background scenery.

  • Question: Does the VLM interpret “high attention anywhere in the image” as evidence of presence? If so, simply “looking at the image” is statistically indistinguishable from “finding the object” for the attention head. How do standard steering methods handle “absence” queries?

C. Layer Specificity
I am currently steering distinct visual attention heads in the middle/late layers (12-28).

  • Question: In LLaVA, are the “Grounding” heads (responsible for finding objects) distinct from the “Answering” heads (responsible for outputting Yes/No)? Is it possible to steer visual processing without propagating that signal to the decision-making layers?

Any advice on debugging this “Grounding vs. Faithfulness” gap would be appreciated!

1 Like

I did a little experiment with a smaller LLaVA model.


Why your steering helps CHAIR captioning but not POPE Yes/No

A useful framing is that these two benchmarks stress different failure modes:

  • CHAIR (captioning) is dominated by multi-token drift: the model starts grounded, then language priors (co-occurrence, “table→chairs”) gradually take over during longer decoding. CHAIR explicitly measures mentioning objects not in the image in open-ended captions. (aclanthology.org)
  • POPE (Yes/No) turns hallucination into a binary classification: object present? The output is often decided by the first answer token (“Yes” vs “No”), and POPE even tracks “Yes-ratio” bias because models can overproduce “Yes.” (aclanthology.org)

Your intervention (“restore attention to image tokens when it drops”) is naturally aligned with drift (captioning) but only weakly aligned with binary decision (POPE), especially when the decision is effectively taken at step-0.


A. The “Prefill gap”: why decode-only steering often can’t flip Yes/No

What prefill does (and why it matters)

In standard autoregressive generation, the prefill phase processes the whole prompt to:

  1. build the KV cache, and
  2. compute logits for the first generated token. (Hugging Face)

For a Yes/No task, that first generated token is often the decision itself. Once the model has produced a very confident step-0 margin (Yes vs No), decode-time-only “look harder” rarely changes it unless you recompute the step-0 logits under the intervention.

Evidence that this is the right lever

Many “training-free hallucination fixes” that improve POPE-style behavior operate by altering the output distribution early, not only by keeping attention on image tokens late:

  • VCD (Visual Contrastive Decoding) explicitly contrasts logits from the real image vs a distorted image to reduce unimodal/language priors and statistical bias. This is a logit-space correction that directly changes token selection. (CVF Open Access)

Practical implication for your method

If your steering is only active during decoding (after prefill), you should expect:

  • Captioning gains (because you correct drift over many tokens)
  • Limited Yes/No gains whenever the step-0 decision is already confident

Debug check: always compute the step-0 Yes/No logit margin (as you did). If the margin is large before decoding, decode-only steering is unlikely to flip it.


B. The “signal of absence”: why “high attention somewhere” doesn’t mean “object absent”

Attention ≠ evidence

Even if you successfully push attention mass onto image tokens, that does not guarantee the model has formed an internal representation like “dog absent.” Attention is a routing mechanism; its weights are not reliably a causal explanation of the decision. (arXiv)

So it is possible (and common) to observe:

  • high attention to image tokens
  • unchanged Yes/No decision

Absence/negation is genuinely hard for VLMs

A big part of “object absence” is a form of negation reasoning (“there is no dog”). Multiple recent results show modern vision-language models struggle with negation and “not/without”-style semantics. (CVF Open Access)

That means “force looking” can easily lead to:

  • attention moving to background or other objects (cats, blanket),
  • but the model still defaulting to a biased “Yes,” especially on adversarial/negative POPE queries.

What tends to work better for absence queries

Methods that create a counterfactual or contrastive baseline provide a more meaningful “absence signal” than “attention anywhere”:

  • Compare logits for real image vs null / heavily corrupted image (or “blur/gray/zero”) and use the difference as “visual evidence strength.”
  • VCD is a canonical example of this idea in decoding. (CVF Open Access)

C. Layer specificity: “grounding heads” vs “answering heads” is not cleanly separable

LLaVA mixing makes late attention a weak handle on decisions

LLaVA-style models insert projected vision tokens into the LLM sequence and then use standard transformer blocks to mix modalities throughout the residual stream. (arXiv)

So even if some heads appear “visual,” it doesn’t guarantee:

  • they causally control the Yes/No logits,
  • or that boosting them late will move the decision.

Token pruning results support “vision influence saturates early/mid”

Work on LLaVA-1.5 vision token pruning shows you can drop a large fraction of vision tokens by mid layers with minimal accuracy loss. That strongly suggests that by mid-depth the model has already extracted what it will use, and later layers are often dominated by language-side consolidation. (arXiv)

Implication: steering layers 12–28 may stabilize captioning trajectories (good) but still fail to change the early representation that controls a step-0 Yes/No decision (bad for POPE).


How to debug the “Grounding vs Faithfulness” gap efficiently

1) Convert POPE into a pure step-0 classification probe

  • Force max_new_tokens=1
  • Score a Yes token set vs No token set at step-0 (logsumexp over variants like “Yes/ yes”, “No/ no”).
  • Track POPE’s Yes-ratio as a bias indicator. (aclanthology.org)

This removes generation noise and makes “did the decision move?” unambiguous.

2) Use a stronger “vision removed” baseline than light Gaussian noise

Your earlier run showed small sensitivity between real and mildly distorted input (common if distortion is too weak). For diagnosis, use one of:

  • pixel_values = 0 (hard null)
  • very large noise (sigma 1–2 in normalized space)
  • heavy blur + downsample/upscale

The goal is not realism; it’s to approximate “no usable vision signal.”

3) Bucket failures into two categories (this saves time)

For each example compute:

  • m_text (text-only prior margin)
  • m_real (real image margin)
  • m_null (null/corrupted image margin)

Then classify:

  • Prior-dominated: m_real ≈ m_text and m_real ≈ m_null
    → vision isn’t moving step-0; decode-time attention fixes won’t help much.
  • Vision-sensitive but wrong: m_real differs from m_null, but the answer is still wrong
    → likely perception/recognition failure or dataset ambiguity; attention boosting won’t fix missing features.

4) Test whether your method can affect step-0 at all

Do a controlled experiment:

  • run prefill to just before the first answer token,
  • apply your steering (KV/attention modification),
  • recompute the step-0 logits.

If step-0 doesn’t move, your mechanism is operating “too late” for POPE by construction.

5) Consider a hybrid: keep your captioning steering, add a Yes/No logit correction

A common pattern is:

  • Captioning: keep your decode-time grounding (drift control)
  • Yes/No: apply a step-0 correction (contrastive/logit calibration)

VCD is a good reference design for the Yes/No side because it directly targets unimodal priors by contrasting real vs distorted distributions. (CVF Open Access)

6) Don’t overinterpret “attention to background”

Given the literature on attention not being a faithful explanation, treat “attention moved to background” as a symptom (the model is forced to route attention somewhere) rather than a decision signal (“therefore it believes presence/absence”). (arXiv)


Where your current observations fit relative to known methods

Your captioning success resembles mechanisms described in decoding-focused hallucination work:

  • OPERA reports hallucinations correlate with attention patterns that over-trust a few summary tokens during decoding and proposes decoding-time penalties/rollback. This kind of mechanism is naturally aligned with long-form drift. (CVF Open Access)

Your POPE failure aligns with what POPE and VCD emphasize:

  • POPE frames hallucination as binary probing and highlights bias and co-occurrence effects in LVLMs. (arXiv)
  • VCD targets unimodal priors/statistical bias via contrastive distributions—precisely the regime where “looking more” is insufficient. (CVF Open Access)

Concrete next step

Run 50–200 POPE negatives and log three margins per example (m_text, m_real, m_null). Then answer two questions:

  1. How many failures are prior-dominated? (vision not affecting step-0)
  2. When you apply your steering during prefill, does m_real move?

If (1) is large and (2) is small, the right fix is not “more visual attention,” but step-0 logit calibration (VCD-like) or prefill-time intervention (recomputing step-0 under the intervention).

# Single-file demo (no argparse): LLaVA step-0 Yes/No diagnostics + caption sanity
#
# Deps:
#   pip -q install -U "transformers>=4.40" "accelerate>=0.25" "bitsandbytes>=0.43" pillow requests
#
# Notes:
# - 4-bit NF4 loading via BitsAndBytesConfig. :contentReference[oaicite:3]{index=3}
# - For Intel/llava-gemma-2b, prompt formatting is via tokenizer.apply_chat_template. :contentReference[oaicite:4]{index=4}
# - LLaVA docs warn some checkpoints require setting processor.patch_size / num_additional_image_tokens /
#   vision_feature_select_strategy. :contentReference[oaicite:5]{index=5}
# - If you see token/feature mismatch, it can also be caused by truncation of expanded image placeholders;
#   this demo disables truncation and bumps model_max_length. :contentReference[oaicite:6]{index=6}

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig

# -------------------------
# Config (edit here)
# -------------------------
MODEL_ID = "Intel/llava-gemma-2b"  # primary demo model (you already downloaded it)

# If you want a non-Gemma fallback that is typically stable in HF LLaVA format:
# MODEL_ID = "xtuner/llava-phi-3-mini-hf"

IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"
QUESTION_YN = "Is there a dog in this image? Answer Yes or No."
QUESTION_CAPTION = "Describe this image."

DISTORT_K = 3
DISTORT_SIGMA = 0.20
DISTORT_SEED = 0

GEN_MAX_NEW_TOKENS = 16
GEN_DO_SAMPLE = False

DTYPE = torch.float16

# Memory safety (T4): leave headroom for activations
MAX_MEMORY = {0: "13GiB", "cpu": "10GiB"}

# -------------------------
# Utilities
# -------------------------
def print_cuda_info():
    if torch.cuda.is_available():
        p = torch.cuda.get_device_properties(0)
        print(f"GPU: {p.name} | VRAM: {p.total_memory/(1024**3):.1f} GB")
        print(f"torch: {torch.__version__} | device: {torch.cuda.current_device()}")
    else:
        print("CUDA not available.")

def cuda_mem(tag=""):
    if not torch.cuda.is_available():
        return
    a = torch.cuda.memory_allocated() / (1024**3)
    r = torch.cuda.memory_reserved() / (1024**3)
    print(f"[mem]{' '+tag if tag else ''} allocated={a:.2f}GB reserved={r:.2f}GB")

def load_image(url: str) -> Image.Image:
    r = requests.get(url, stream=True, timeout=30)
    r.raise_for_status()
    return Image.open(r.raw).convert("RGB")

def make_prompt(tokenizer, question: str, with_image: bool) -> str:
    # Intel/llava-gemma-2b model card uses this exact pattern. :contentReference[oaicite:7]{index=7}
    if hasattr(tokenizer, "apply_chat_template") and getattr(tokenizer, "chat_template", None):
        content = f"<image>\n{question}" if with_image else question
        msgs = [{"role": "user", "content": content}]
        return tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

    # Fallback (debug-only)
    return f"USER: {'<image>\\n' if with_image else ''}{question}\nASSISTANT:"

def first_token_ids(tokenizer, strings):
    ids = []
    for s in strings:
        toks = tokenizer(s, add_special_tokens=False).input_ids
        if toks:
            ids.append(toks[0])
    return sorted(set(ids))

def logsumexp_ids(logits_1d, ids):
    x = logits_1d[ids]
    return torch.logsumexp(x, dim=0)

def yesno_margin_step0(logits_1d, yes_ids, no_ids) -> float:
    y = logsumexp_ids(logits_1d, yes_ids)
    n = logsumexp_ids(logits_1d, no_ids)
    return float((y - n).item())

def get_image_token_id(model, tokenizer):
    itok = getattr(model.config, "image_token_id", None) or getattr(model.config, "image_token_index", None)
    if itok is not None:
        return int(itok)
    # last-resort fallback
    tid = tokenizer.convert_tokens_to_ids("<image>")
    return int(tid) if tid is not None else None

def fix_processor_fields(processor, model):
    """
    Fix missing processor fields (patch_size=None etc.). This is recommended by LLaVA docs and
    is a known issue for LLaVA-Gemma processors. :contentReference[oaicite:8]{index=8}
    """
    # patch_size
    ps = getattr(processor, "patch_size", None)
    if ps is None:
        ps = getattr(getattr(model.config, "vision_config", None), "patch_size", None)
    if ps is None:
        ps = 14
    processor.patch_size = int(ps)

    # feature select strategy
    vsel = getattr(processor, "vision_feature_select_strategy", None)
    if vsel is None:
        vsel = getattr(model.config, "vision_feature_select_strategy", None)
    if vsel is None:
        vsel = "default"
    processor.vision_feature_select_strategy = str(vsel)

    # additional image tokens
    # IMPORTANT: many VLM encoders include a CLS token; if this is wrong, you get a +1 mismatch. :contentReference[oaicite:9]{index=9}
    nai = getattr(processor, "num_additional_image_tokens", None)
    if nai is None:
        nai = 1
    processor.num_additional_image_tokens = int(nai)

    print(
        f"[processor fix] patch_size={processor.patch_size}, "
        f"num_additional_image_tokens={processor.num_additional_image_tokens}, "
        f"vision_feature_select_strategy={processor.vision_feature_select_strategy}"
    )

@torch.inference_mode()
def vision_token_count(model, pixel_values: torch.Tensor) -> int:
    """
    Count vision tokens produced by the vision tower for these pixel_values.
    We use get_image_features (used internally by LLaVA models). :contentReference[oaicite:10]{index=10}
    """
    feats = model.get_image_features(pixel_values=pixel_values, return_dict=True).pooler_output
    # pooler_output can be list[Tensor] or Tensor; shapes can be (B,N,D) or (N,D) depending on model wiring.
    if isinstance(feats, (list, tuple)):
        f = feats[0]
    else:
        f = feats
    if f.ndim == 3:
        return int(f.shape[1])
    if f.ndim == 2:
        return int(f.shape[0])
    raise RuntimeError(f"Unexpected feature shape: {tuple(f.shape)}")

@torch.inference_mode()
def encode_mm_with_autoalign(model, processor, prompt: str, image: Image.Image, max_tries: int = 3):
    """
    Encode multimodal inputs and auto-fix the common off-by-one mismatch:
      placeholders (image tokens) != vision tokens
    This avoids the forward() crash "Image features and image tokens do not match". :contentReference[oaicite:11]{index=11}
    """
    tok = processor.tokenizer
    itok = get_image_token_id(model, tok)

    # Avoid truncation of expanded image placeholder tokens. :contentReference[oaicite:12]{index=12}
    if getattr(tok, "model_max_length", 0) and tok.model_max_length < 8192:
        tok.model_max_length = 8192

    for attempt in range(1, max_tries + 1):
        enc = processor(
            text=prompt,
            images=image,
            return_tensors="pt",
            truncation=False,
            padding=False,
        )
        enc = {k: v.to(model.device) for k, v in enc.items()}
        if "pixel_values" in enc:
            enc["pixel_values"] = enc["pixel_values"].to(DTYPE)

        if itok is None:
            return enc  # cannot count placeholders; proceed

        n_place = int((enc["input_ids"] == itok).sum().item())
        n_feat = vision_token_count(model, enc["pixel_values"])

        if n_place == n_feat:
            if attempt > 1:
                print(f"[autoalign] fixed after {attempt-1} adjustment(s): placeholders={n_place}, feat_tokens={n_feat}")
            return enc

        delta = n_feat - n_place
        print(f"[autoalign] mismatch attempt {attempt}: placeholders={n_place}, feat_tokens={n_feat}, delta={delta}")

        # adjust additional image tokens to close the gap (often delta==+1)
        processor.num_additional_image_tokens = int(getattr(processor, "num_additional_image_tokens", 0) + delta)

    raise ValueError("Could not auto-align image tokens/features after retries; check prompt (<image>) and processor settings.")

@torch.inference_mode()
def step0_mm(model, processor, prompt, image):
    enc = encode_mm_with_autoalign(model, processor, prompt, image)
    out = model(**enc, use_cache=False, return_dict=True)
    return out.logits[0, -1].float().cpu(), enc

@torch.inference_mode()
def step0_text(model, processor, prompt):
    enc = processor(text=prompt, return_tensors="pt", truncation=False, padding=False)
    enc = {k: v.to(model.device) for k, v in enc.items()}
    out = model(**enc, use_cache=False, return_dict=True)
    return out.logits[0, -1].float().cpu()

@torch.inference_mode()
def step0_distorted_avg(model, enc_mm, k=3, sigma=0.2, seed=0):
    torch.manual_seed(seed)
    input_ids = enc_mm["input_ids"]
    attention_mask = enc_mm.get("attention_mask", None)
    pixel_values = enc_mm["pixel_values"]

    acc = None
    for _ in range(k):
        noise = torch.randn_like(pixel_values) * sigma
        pv = (pixel_values + noise).clamp(-3.0, 3.0)
        kwargs = {"input_ids": input_ids, "pixel_values": pv}
        if attention_mask is not None:
            kwargs["attention_mask"] = attention_mask
        out = model(**kwargs, use_cache=False, return_dict=True)
        logits = out.logits[0, -1].float()
        acc = logits if acc is None else (acc + logits)
    return (acc / k).cpu()

@torch.inference_mode()
def generate_mm(model, processor, prompt, image, max_new_tokens=16, do_sample=False):
    enc = encode_mm_with_autoalign(model, processor, prompt, image)
    gen = model.generate(**enc, max_new_tokens=max_new_tokens, do_sample=do_sample, use_cache=True)
    return processor.batch_decode(gen, skip_special_tokens=True)[0]

# -------------------------
# Main
# -------------------------
def main():
    torch.backends.cuda.matmul.allow_tf32 = True
    print_cuda_info()

    qconf = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=DTYPE,
        bnb_4bit_use_double_quant=True,
    )

    processor = AutoProcessor.from_pretrained(MODEL_ID)
    tok = processor.tokenizer
    if tok is not None:
        tok.padding_side = "left"

    model = LlavaForConditionalGeneration.from_pretrained(
        MODEL_ID,
        device_map="auto",
        max_memory=MAX_MEMORY,
        torch_dtype="auto",
        quantization_config=qconf,
    )
    model.eval()
    cuda_mem("after load")

    fix_processor_fields(processor, model)

    image = load_image(IMAGE_URL)
    print(f"Image loaded: {IMAGE_URL} size={image.size}")

    # Step-0 Yes/No token IDs (first-token only)
    yes_ids = first_token_ids(tok, ["Yes", " yes"])
    no_ids = first_token_ids(tok, ["No", " no"])
    print(f"Yes first-token ids: {yes_ids}")
    print(f"No  first-token ids: {no_ids}")

    prompt_yn_mm = make_prompt(tok, QUESTION_YN, with_image=True)
    prompt_yn_txt = make_prompt(tok, QUESTION_YN, with_image=False)
    prompt_cap_mm = make_prompt(tok, QUESTION_CAPTION, with_image=True)

    print("\n=== YES/NO: step-0 (prefill) diagnostics ===")
    logits_real, enc_mm = step0_mm(model, processor, prompt_yn_mm, image)
    logits_text = step0_text(model, processor, prompt_yn_txt)
    logits_dist = step0_distorted_avg(model, enc_mm, k=DISTORT_K, sigma=DISTORT_SIGMA, seed=DISTORT_SEED)

    m_real = yesno_margin_step0(logits_real, yes_ids, no_ids)
    m_text = yesno_margin_step0(logits_text, yes_ids, no_ids)
    m_dist = yesno_margin_step0(logits_dist, yes_ids, no_ids)
    sens = abs(m_real - m_dist)

    print(f"m_real (MM)        = {m_real:+.3f}")
    print(f"m_text (text-only) = {m_text:+.3f}")
    print(f"m_dist (distorted) = {m_dist:+.3f}   (K={DISTORT_K}, sigma={DISTORT_SIGMA})")
    print(f"sensitivity |m_real - m_dist| = {sens:.3f}")

    print("\n[gen] Yes/No output (short):")
    print(generate_mm(model, processor, prompt_yn_mm, image, max_new_tokens=GEN_MAX_NEW_TOKENS, do_sample=GEN_DO_SAMPLE))

    cuda_mem("after yes/no")

    print("\n=== CAPTION: short generation ===")
    print(generate_mm(model, processor, prompt_cap_mm, image, max_new_tokens=GEN_MAX_NEW_TOKENS, do_sample=GEN_DO_SAMPLE))

    cuda_mem("done")

if __name__ == "__main__":
    main()
1 Like