[Help Needed] Dual-Phase Softmax Steering on Llama-2 Residual Stream Yields Identical POPE Results

Hi everyone,

I am working on a hallucination mitigation technique for LLaVA-1.5-7b (based on Llama-2) called “Memory-Consistent Linear Control Barrier Functions (MC-LCBF)”. We are attempting to steer the model’s attention to fix two distinct failure modes: Generative Hallucination (drifting away from the image in long contexts) and Discriminative Bias (Step-0 “Yes/No” errors on the POPE benchmark).

The Core Issue

Despite implementing a theoretically sound “Dual-Phase” Control Barrier Function (CBF) using Softmax gradients and intervening directly on the Residual Stream, my results on the POPE benchmark are exactly identical to the baseline. It seems my steering vector θ* is either vanishing, getting normalized away, or is mathematically ineffective against the specific attention heads chosen.

I need a sanity check on my implementation logic regarding LlamaDecoderLayer vs. LlamaAttention and the efficacy of steering the Residual Stream.


1. The Theory: Dual-Phase Intervention & The Softmax Barrier

We address hallucination by splitting the generation process into two distinct phases, applying a Softmax Probability Mass Barrier h(x_t) to target the “Signal of Absence.”

Why Softmax? (Solving the “Signal of Absence”)
Previous methods used a raw Dot-Product Energy Barrier (Q ¡ K^T). While effective for forcing the model to look at existing objects, it fails when an object is absent.

  • The Problem: If a prompt asks “Is there a dog?” and there is no dog, maximizing raw attention energy just forces the model to attend to random visual noise. The model, finding no “dog” features, falls back on its language prior: “questions about dogs are usually answered ‘Yes’.”
  • Our Solution (The Push-Pull Mechanism): We replace raw energy with a Softmax Probability Barrier.
h(x_t) = \sum_{i \in \text{Image}} \text{Softmax}(A)_i \ge \tau

Because Softmax is zero-sum, increasing the probability mass on image tokens mathematically requires decreasing the probability mass on text tokens.

\nabla P_{\text{img}} \propto (1 - P_{\text{img}}) \cdot K_{\text{img}} - P_{\text{img}} \cdot K_{\text{text}}

This gradient creates a Push-Pull Force: it pushes the Query Q towards visual features while actively repelling it from the text tokens (e.g., the word “dog” in the prompt). This suppresses the unimodal language priors that cause hallucinations when visual evidence is missing.

Phase 1: Prefill Intervention (The “Step-0” Problem)
On benchmarks like POPE (Yes/No QA), the model computes its entire answer logit (“Yes” or “No”) immediately after processing the last token of the prompt (seq_len > 1). Standard auto-regressive steering is too late; the decision is already “locked in” by the prefill computation.

  • Mechanism: We detect the prefill phase and intervene specifically on the Last Token of the Prompt. We calculate the Softmax distribution of this specific token against the image and inject θ* to suppress text-based biases before the model commits to a decision.

Phase 2: Decoding Intervention (Generative Drift)
Over long captions (CHAIR benchmark), attention “drifts” away from visual grounding. We intervene on every generated token (seq_len == 1) to maintain the Softmax probability mass on the image above τ, ensuring consistent grounding.


2. The Control Law: Taylor Expansion Linearization

To enforce the barrier, we treat the Transformer layer as a dynamical system and apply input-constrained Optimal Control.

We define the safe set as C = {x : h(x) ≥ τ}. To find the minimal intervention θ* that projects the state back into C, we linearize the highly non-linear Softmax manifold using a First-Order Taylor Expansion around the current state x_t:

h(x_t + \theta) \approx h(x_t) + \nabla_{x_t} h(x_t)^\top \theta

We then solve the QP (Quadratic Program):

\min_{\theta} \frac{1}{2} ||\theta||^2 \quad \text{s.t.} \quad h(x_t) + \nabla h(x_t)^\top \theta \ge \tau

This yields the closed-form solution:

\theta^* = \alpha \cdot \frac{\tau - h(x_t)}{||\nabla_{x_t} h(x_t)||^2} \nabla_{x_t} h(x_t)

This vector θ* represents the direction of steepest ascent on the probability manifold—the most efficient way to shift attention mass from text to image.


4. Code Snippet (Implementation Details)

Here is exactly how I implemented the gradient flow to ensure mathematically correct steering on the un-normalized residual stream:

  1. Intercept hidden_states (Residual Stream): I catch hidden_states at the very start of LlamaDecoderLayer.forward.

  2. Manual Normalization: To calculate what the Attention layer would see, I manually apply self.input_layernorm(hidden_states).

  3. Forward Projection: I project this normalized state to QQ using the layer’s weights (self.self_attn.q_proj) and apply RoPE.

  4. Backward Pass: I run torch.autograd.grad from the Softmax output back to the un-normalized hidden_states.

  5. Update: I add the resulting θ∗θ∗ in-place to the residual hidden_states.

"""

Experiment 6: Dual-Phase MC-LCBF with Softmax Probability Barrier

=================================================================




Methodology:

1.  **Phase 1 (Prefill):** Intervene on the LAST token of the prompt (the question).

    -   Target: Step-0 "Yes/No" decision.

    -   Barrier: Softmax Probability Mass on Image Tokens.

    -   Mechanism: Push (Image) / Pull (Text) via Softmax Gradient.




2.  **Phase 2 (Decode):** Intervene on every generated token.

    -   Target: Generative Drift (CHAIR).

    -   Barrier: Attention Energy (or Softmax) to maintain grounding.

    -   Mechanism: Keep attention on image to prevent hallucination.




Theoretical Justification:

-   **Linearization:** We use a local first-order Taylor expansion of the complex

    softmax/layernorm manifold to derive a closed-form linear correction.

-   **Push-Pull:** The gradient of the softmax barrier naturally suppressed text

    priors when boosting image attention, acting as a "negative constraint" on

    hallucination.

-   **Memory Consistency:** We intervene on the Residual Stream (before Norm) 

    to ensure the correction $\theta^*$ permanently updates the causal memory.




Implementation Details:

-   Hooks into `LlamaDecoderLayer` forward pass (Architecture Fix).

-   Uses `torch.enable_grad()` locally to compute $\nabla_x h(x)$.

-   Differentiates through `LayerNorm` and `Softmax` for accurate gradients.

-   Applies $\theta^*$ correction in-place to the raw residual hidden states.

"""




import os

import sys

import math

import types

import json

import torch

import torch.nn.functional as F

import argparse

import numpy as np

from PIL import Image

from tqdm import tqdm

from transformers import AutoProcessor, LlavaForConditionalGeneration

from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, repeat_kv




# --- Setup Paths ---

PROJECT_DIR = os.getcwd()

sys.path.append(PROJECT_DIR)




POPE_DATA_DIR = os.path.join(PROJECT_DIR, 'data/pope/coco')

POPE_IMAGE_DIR = os.path.join(PROJECT_DIR, 'data/mscoco/val2014')

RESULTS_DIR = os.path.join(PROJECT_DIR, 'results_exp6_dual_phase')

os.makedirs(RESULTS_DIR, exist_ok=True)




# --- Configuration ---

MODEL_ID = "llava-hf/llava-1.5-7b-hf"

CACHE_DIR = os.path.join(PROJECT_DIR, 'model_cache')




STEERING_CONFIG = {

"is_active": True,

# --- Prefill Barrier (Softmax Probability) ---

"prefill_tau": 0.2,      

"prefill_alpha": 1.0,    # Lowered for Residual Stream stability

# --- Decode Barrier (Energy or Softmax) ---

"decode_tau": 0.2,       

"decode_alpha": 0.5,     # Lowered for Residual Stream stability

# --- Shared ---

"steer_layers": list(range(10, 28)), 

"img_start": 0,

"img_end": 0,

}




# =========================================================

# The Dual-Phase Decoder Layer Hook

# =========================================================




def dual_phase_decoder_wrapper(original_forward, layer_idx):

def forward(self, hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, position_embeddings=None, **kwargs):

# NOTE: self here is the LlamaDecoderLayer, NOT LlamaAttention




        bsz, seq_len, hidden_size = hidden_states.size()




# Check if we should steer this layer

if STEERING_CONFIG["is_active"] and layer_idx in STEERING_CONFIG["steer_layers"]:

# --- Logic: Identify Target Token(s) ---

            target_slice = slice(None) 

            do_steer = False

if seq_len > 1: # Prefill

# Target only the last token (Step-0 Decision token)

                target_slice = slice(-1, None)

                tau = STEERING_CONFIG["prefill_tau"]

                alpha = STEERING_CONFIG["prefill_alpha"]

                do_steer = True

else: # Decode

# Target the single token being generated

                target_slice = slice(None)

                tau = STEERING_CONFIG["decode_tau"]

                alpha = STEERING_CONFIG["decode_alpha"]

                do_steer = True




# --- A* PIVOT: Softmax Barrier on Residual Stream ---

if do_steer:

with torch.enable_grad():

# 1. Isolate the Target Residual State x_t

# We detach to start a fresh graph for the gradient calculation

                    x_full_resid = hidden_states

                    x_t_resid = x_full_resid[:, target_slice, :].clone().detach().requires_grad_(True)

# 2. Normalize x_t (Crucial Fix: Simulate Layer's behavior)

# We must pass gradients through this norm to align strictly with math.

                    x_t_norm = self.input_layernorm(x_t_resid)

# 3. Re-construct Q from x_t_norm

                    num_heads = self.self_attn.config.num_attention_heads

                    num_kv_heads = getattr(self.self_attn.config, "num_key_value_heads", num_heads)

                    head_dim = hidden_size // num_heads

                    q_probe = self.self_attn.q_proj(x_t_norm).view(bsz, x_t_resid.shape[1], num_heads, head_dim).transpose(1, 2)

# Apply RoPE

if position_embeddings is not None:

                        cos, sin = position_embeddings

# Handle possible shapes: [B/1, S, H, D], [B/1, S, D], or [S, D]

# For generated tokens or slicing, we need carefully select the right pos embeddings

if cos.dim() == 4:

                            cos_slice = cos[:, target_slice, :, :]

                            sin_slice = sin[:, target_slice, :, :]

elif cos.dim() == 3:

                            cos_slice = cos[:, target_slice, :]

                            sin_slice = sin[:, target_slice, :]

elif cos.dim() == 2:

                            cos_slice = cos[target_slice, :]

                            sin_slice = sin[target_slice, :]

else:

                            cos_slice, sin_slice = cos, sin

                        q_probe, _ = apply_rotary_pos_emb(q_probe, q_probe, cos_slice, sin_slice)




# 4. Construct K (Key Cache + Current)

                    img_start, img_end = STEERING_CONFIG["img_start"], STEERING_CONFIG["img_end"]

                    all_keys = None




if seq_len > 1: # Prefill

# Fix: Must normalize x_full_resid before projecting Keys!

with torch.no_grad():

# Normalize the full sequence for Key generation

                            x_full_norm = self.input_layernorm(x_full_resid.detach())

                            k_full = self.self_attn.k_proj(x_full_norm).view(bsz, seq_len, num_kv_heads, head_dim).transpose(1, 2)

if position_embeddings is not None:

                                k_full, _ = apply_rotary_pos_emb(k_full, k_full, cos, sin)

                        all_keys = k_full

else: # Decode

# Use past_key_values from cache

# Note: In `LlamaDecoderLayer`, `past_key_value` might be passed as argument

# But `LavaForConditionalGeneration` usually manages a global `past_key_values` cache object.

# The `past_key_values` arg in forward is usually the full tuple.

# We need to access the cache for THIS layer.

# In HF implementation, `past_key_value` (singular) is often passed for the specific layer

# or we access `past_key_values[layer_idx]`.

# Let's try to robustly find the cache.

# The signature has `past_key_value` (singular) which is legacy, and `past_key_values` (plural).

                        current_keys = None

# Check `past_key_value` (Layer-specific cache in some HF versions)

if past_key_value is not None:

                            current_keys = past_key_value[0] # [B, n_heads, seq, dim]

# Check `past_key_values` (Global cache tuple)

elif kwargs.get('past_key_values') is not None and len(kwargs.get('past_key_values')) > layer_idx:

                            current_keys = kwargs.get('past_key_values')[layer_idx][0]

if current_keys is not None:

                            all_keys = current_keys




# 5. Compute Softmax Barrier

if all_keys is not None:

if num_heads != num_kv_heads:

                            all_keys = repeat_kv(all_keys, num_heads // num_kv_heads)

# Q * K^T

                        attn_logits = torch.matmul(q_probe, all_keys.transpose(-1, -2)) / math.sqrt(head_dim)

# Softmax

                        attn_probs = F.softmax(attn_logits, dim=-1) 

# Barrier Function h(x): Sum of probabilities on Image Tokens

                        valid_end = min(img_end, attn_probs.shape[-1])

                        valid_start = min(img_start, valid_end)

if valid_end > valid_start:

                            image_mass = attn_probs[:, :, :, valid_start:valid_end].sum(dim=-1)

                            h_val = image_mass.mean()

if h_val < tau:

# Gradient w.r.t x_t_resid (The Residual State)

                                grads = torch.autograd.grad(h_val, x_t_resid, retain_graph=False)[0]

                                grad_norm_sq = torch.sum(grads * grads)

# Scale Calculation

                                scale = (tau - h_val) / (grad_norm_sq + 1e-6)

# Clamp (Critical for Residual Stream Stability)

if scale > 0.5: scale = 0.5 

                                theta = scale * grads

# Apply Correction to Residual Stream

                                hidden_states[:, target_slice, :] += alpha * theta.detach()




# Execute Original Forward (now with modified hidden_states)

return original_forward(

            hidden_states=hidden_states,

            attention_mask=attention_mask,

            position_ids=position_ids,

            past_key_value=past_key_value,

            output_attentions=output_attentions,

            use_cache=use_cache,

            cache_position=cache_position,

            position_embeddings=position_embeddings,

            **kwargs

        )




return forward




# =========================================================

# Generator Class

# =========================================================




class DualPhase_LCBF_Generator:

def __init__(self, model, processor):

self.model = model

self.processor = processor

self.image_token_id = processor.tokenizer.convert_tokens_to_ids("<image>")




        print(f"\nInitializing Dual-Phase MC-LCBF Controller (Residual Stream Wrapped)...")

        print(f"  Prefill Target Mass: {STEERING_CONFIG['prefill_tau']}")

        print(f"  Decode Target Mass:  {STEERING_CONFIG['decode_tau']}")

        print(f"Injecting hooks into Layers {min(STEERING_CONFIG['steer_layers'])}-{max(STEERING_CONFIG['steer_layers'])}...")

# FIX: Wrap DecoderLayer, not Attention

        layers = model.language_model.layers if hasattr(model, "language_model") else model.model.layers

for i, layer in enumerate(layers):

# We wrap the layer.forward, not layer.self_attn.forward

            layer.forward = types.MethodType(

                dual_phase_decoder_wrapper(layer.forward, i),

                layer,

            )




def generate(self, image, prompt, steering=True):

        STEERING_CONFIG["is_active"] = steering

        inputs = self.processor(text=prompt, images=image, return_tensors="pt")

# Estimate image span (LLaVA-1.5 specific)

        STEERING_CONFIG["img_start"] = 1 

        STEERING_CONFIG["img_end"] = 577

        input_ids = inputs.input_ids.to(self.model.device)

        pixel_values = inputs.pixel_values.to(self.model.device, dtype=torch.float16)




with torch.no_grad():

            output_ids = self.model.generate(

                input_ids, 

                pixel_values=pixel_values, 

                max_new_tokens=5, 

                do_sample=False,

                use_cache=True

            )

return self.processor.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)




# =========================================================

# Main Execution

# =========================================================




if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    parser.add_argument("--type", type=str, default="random", choices=['random', 'popular', 'adversarial'])

    parser.add_argument("--num_images", type=int, default=None, help="Number of images to process")

    args = parser.parse_args()

# Init

    print(f"Loading {MODEL_ID}...")

    model = LlavaForConditionalGeneration.from_pretrained(

        MODEL_ID, 

        torch_dtype=torch.float16, 

        low_cpu_mem_usage=True,

        attn_implementation="eager",

    ).to("cuda")

    processor = AutoProcessor.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)

    gen = DualPhase_LCBF_Generator(model, processor)

# Load POPE Data

    question_file = os.path.join(POPE_DATA_DIR, f"coco_pope_{args.type}.json")

    out_file = os.path.join(RESULTS_DIR, f"exp6_pope_{args.type}_dual_phase.jsonl")

    questions = []

with open(question_file, 'r') as f:

for line in f:

            questions.append(json.loads(line))




if args.num_images is not None:

        questions = questions[:args.num_images]

    print(f"Evaluating {len(questions)} samples on {args.type}...")

    results = []

for item in tqdm(questions):

try:

            image_path = os.path.join(POPE_IMAGE_DIR, item['image'])

if not os.path.exists(image_path): continue

            image = Image.open(image_path).convert("RGB")

            prompt = f"USER: <image>\n{item['text']}\nASSISTANT:"

# Baseline (No Steering)

            ans_base = gen.generate(image, prompt, steering=False).lower().strip()

            pred_base = "yes" if "yes" in ans_base else ("no" if "no" in ans_base else "no")

# Steered (Dual Phase)

            ans_steer = gen.generate(image, prompt, steering=True).lower().strip()

            pred_steer = "yes" if "yes" in ans_steer else ("no" if "no" in ans_steer else "no")

            res = {

'question': item['text'], 

'label': item['label'], 

'baseline_pred': pred_base,

'steered_pred': pred_steer

            }

            results.append(res)

with open(out_file, 'a') as f:

                f.write(json.dumps(res) + "\n")

except Exception as e:

            print(f"Error: {e}")

# Metrics

    acc_base = sum([1 for x in results if x['label'] == x['baseline_pred']]) / len(results)

    acc_steer = sum([1 for x in results if x['label'] == x['steered_pred']]) / len(results)

    print(f"\nFinal Results Exp 6 ({args.type}):")

    print(f"Baseline Accuracy: {acc_base:.4f}")

    print(f"Steered Accuracy:  {acc_steer:.4f}")

5. Why is POPE Not Improving?

This is my main concern. Even with alpha=1.0 (or 4.0) and tau=0.5 (forcing 50% attention on the image), the final generated “Yes/No” tokens on POPE are identical to the baseline.

My Questions for the Community:

  1. RMSNorm Gradient Vanishing: Does backpropagating through LlamaRMSNorm dampen the gradient so much that the update to the residual stream is negligible? Or does the non-linearity of RMSNorm distort the assumption that grad(norm(x)) points in a useful direction for x?

  2. KV Cache Mismatch: Since LlamaDecoderLayer modifies the KV cache in-place during the forward pass, does my “look-ahead” calculation of KK (derived from a detached x_full which I manually project) create a mismatch with the actual past_key_values used by the model?

  3. Residual Dampening: Is adding a vector to the residual stream

(dim 4096) ineffective because the subsequent MLP block or next layer’s Norm immediately crushes this perturbation?

Any insights on steering the Residual Stream vs. Post-Norm states in Llama-2 would be incredibly helpful. Are my gradients flowing correctly through that manual Norm step?

Thanks!

1 Like

Within the scope of my experiments with transformers<5. There are still many unknowns:


Why you can get identical POPE outputs even when the Softmax-mass math is “right”

There are two broad possibilities, and your description fits both:

  1. The intervention is not actually affecting the model’s forward pass used to produce the POPE answer (a silent no-op).
  2. The intervention changes your chosen proxy metric (image attention mass), but that proxy is not causally aligned with the step-0 Yes/No decision in POPE (a real effect, but not on the decision boundary).

The fastest way to separate them is to add one “unit test”:

  • Run a single prefill forward (no generate, no cache) and compare step-0 scores for " yes" vs " no" with and without the intervention. If those logits are identical to within numerical noise across many samples, you likely have a no-op. If they move but predictions don’t change, it’s likely an objective mismatch or insufficient targeting.

Implementation-level no-ops that produce “identical outputs”

1) Barrier computed on attention that does not match the model’s attention

Your manual attention reconstruction must match three details from the reference Llama attention path:

  • Masking: the model adds the attention mask before Softmax.
  • Softmax dtype: the model upcasts Softmax to float32 and casts back.
  • RoPE / positions: queries and keys must use the same positions/cos/sin slices as the actual forward.

Hugging Face’s Llama attention explicitly does:

  • add mask before Softmax, then
  • Softmax in float32, then cast back. (GitHub)

If your probe uses F.softmax(attn_logits, dim=-1) in fp16/bf16 and/or skips adding the model’s mask, your computed h and ∇h can point in a direction that doesn’t correspond to the model’s true computation. In that case you can “satisfy” your local constraint without producing the intended change in the real forward.

Practical symptom: you can print a large theta norm or a large change in your locally computed h, but step-0 logits don’t move.


2) Hard-coded image span (or span computed on the wrong IDs)

LLaVA’s config defines image_token_index (default 32000) and the placeholder expands into a block of repeated image tokens. (Hugging Face)
But the start/end positions in input_ids can shift with templating, system prompts, and tokenization details. If you hardcode img_start/img_end, many examples will “steer” the wrong region (text tokens or padding), producing no meaningful effect on visual grounding.

Rule: compute span from input_ids == image_token_index every sample, then use the contiguous block.


3) Prefill key reconstruction uses a detached full sequence, then you modify the sequence

In your prefill branch, you compute:

  • k_full from x_full_resid.detach() under no_grad, then
  • compute ∇h w.r.t. x_t_resid (query-side), then
  • add theta into hidden_states.

This creates a consistency gap: the gradient is computed using keys derived from a different state than the state you ultimately pass into the real forward. Even if you only differentiate through queries, the “best” query move depends on the actual keys, and your keys are stale.

If you want query-only gradients, you still want keys computed from the same forward-state that will be used for attention.


4) Cache pathway ambiguity (decode phase especially)

Transformers has moved to explicit cache objects (Cache, DynamicCache, etc.) and documents multiple cache strategies. (Hugging Face)
If your wrapper passes both legacy and new cache arguments (or reads keys from a different representation than the attention module is using), decode-time steering can silently become a no-op or use the wrong tensors.

Two concrete failure modes:

  • You compute all_keys from past_key_value / kwargs['past_key_values'] incorrectly for the current Transformers version.
  • You call the original forward with conflicting cache args, so the model uses a cache path that your steering logic didn’t anticipate.

Even if prefill works, decode can quietly fail this way. The cache docs emphasize that the cache is integral to how keys/values are reused and updated. (Hugging Face)


Answers to your three questions

1) Does RMSNorm make the gradient useless?

Usually, no—but it changes what kinds of perturbations matter.

Key points:

  • RMSNorm is scale-normalizing (rescaling invariance). The original RMSNorm paper describes this invariance explicitly. (arXiv)
  • Hugging Face’s Llama RMSNorm implementation (and PyTorch’s RMSNorm) typically does computation in float32 then casts back, which helps numerical stability rather than killing gradients. (Hugging Face)

What that means for residual-stream steering:

  • Any perturbation that mostly changes the magnitude of hidden_states will be largely washed out by RMSNorm.
  • Perturbations that change the direction of the hidden state vector are what survive normalization.

So the more relevant diagnostic is not “is the gradient small”, but:

  • Is your theta mostly parallel to x (gets normalized away), or does it change direction?

A simple check: compute cosine similarity between theta and the original last-token hidden state. If it is very close to 1 or -1, normalization will reduce the effective change.

Bottom line: RMSNorm is unlikely to cause literal vanishing gradients by itself; it mainly makes “directional” control matter more than “amplitude” control.


2) Can KV cache mismatch make your look-ahead K wrong?

Yes—this is one of the most common sources of silent failure.

  • The cache is where the model stores attention-layer derived key/value states for previously processed tokens. (Hugging Face)
  • Transformers supports multiple cache classes and behaviors (DynamicCache, StaticCache, etc.). (Hugging Face)

In decode (seq_len == 1), the keys you need are the concatenation of:

  • past cached keys, plus
  • the current token’s key (after projection + RoPE) appended at the right position.

If you reconstruct all_keys outside the attention module and don’t exactly match:

  • cache format,
  • head expansion (repeat_kv / GQA),
  • position indexing (cache_position),
    you can compute a barrier that is disconnected from what the model actually uses.

Most robust fix: compute your barrier inside (or immediately around) the attention module, where key_states and the cache update are already correct.


3) Is residual-stream injection ineffective because later blocks crush it?

It can be, but “ineffective” usually means “insufficiently targeted,” not “always crushed.”

Reasons a perturbation may not survive to logits:

  • You inject at one layer, but the model’s decision is dominated by later layers.
  • The MLP path and later attention layers can re-route information, especially if only a subset of heads are cross-modal critical.
  • If you average h across all heads, you can raise the mean image mass while leaving the few decisive heads unchanged.

So the architecture doesn’t guarantee that “more image attention mass” in one layer maps to a different Yes/No decision.

Actionable improvement: head- and layer-selective constraints (see below).


Why POPE may not improve even if your image-mass barrier is enforced

Even if the barrier truly increases image attention mass, POPE’s step-0 Yes/No is a logit-margin decision at the LM head. Attention mass is only an indirect proxy. Two common cases:

  1. The model already attends to the image enough to extract the needed signal, but still answers “yes” due to a learned prior or dataset bias.
  2. Your constraint increases attention to the image tokens, but not to the right visual features (or not in the heads/layers that carry the classification signal).

This is especially plausible because the model’s attention distribution can shift without materially changing the final hidden state projection onto the “yes/no” directions.


A practical checklist that tends to make this work (or conclusively show why it doesn’t)

A) Make the barrier computation match the reference implementation

  • Add the same attention mask before Softmax, and compute Softmax in float32 (then cast back). (GitHub)
  • Ensure RoPE uses the correct positions for the probed token(s).

B) Verify the intervention changes something upstream of the LM head

For prefill (POPE step-0):

  • Run a forward with use_cache=False and measure the yes/no logit margin. Do it on the first differing token if your tokenizer splits " yes" / " no" (this happens depending on tokenizer settings and string form).
  • Confirm that the hidden state at the final layer last position changes (e.g., L2 diff).

C) Compute image span per sample

Use image_token_index from the model config and locate tokens in input_ids. (Hugging Face)

D) Fix cache semantics and avoid mixed cache arguments

Align with Transformers cache docs and use one cache pathway consistently. (Hugging Face)

E) Stop averaging away the signal

Instead of mean over all heads:

  • compute per-head image mass
  • target top-k heads (or enforce a minimum across a selected set)
    This often produces a larger change in representations than moving the mean slightly.

F) Add a POPE-aligned constraint

If the goal is Yes/No accuracy, add a small controller that directly targets the logit margin (or sequence logprob margin) at step-0, and use image-mass only as an auxiliary constraint. Attention-mass alone is often too far from the decision variable.


Background: what the reference stack is doing that your probe must match

  • LLaVA config defines where image tokens live via image_token_index. (Hugging Face)
  • Llama attention adds mask before Softmax and uses float32 Softmax. (GitHub)
  • RMSNorm is scale-invariant and commonly upcasts internally for stability. (arXiv)
  • Generation cache is now an explicit set of cache classes; behavior differs by cache type. (Hugging Face)

If any one of these mismatches your reconstructed attention, your computed gradient can be “correct” for the probe but irrelevant for the real forward.


What your situation most likely is (based on typical failure patterns)

If your POPE outputs are exactly identical, the highest-probability causes are:

  1. Silent no-op due to mismatch between your probed attention and the model’s real attention (mask + fp32 softmax + RoPE + cache_position). (GitHub)
  2. Incorrect image span on at least a large subset of samples (hardcoded indices). (Hugging Face)
  3. Cache pathway mismatch in decode or mixed cache argument passing. (Hugging Face)

Only after you confirm step-0 logits move under use_cache=False should you conclude that “attention-mass steering is mathematically ineffective.” In practice, it’s often effective at moving the proxy metric, but POPE needs either head/layer targeting or a margin-based constraint to change predictions.