Hi everyone,
I am working on a hallucination mitigation technique for LLaVA-1.5-7b (based on Llama-2) called âMemory-Consistent Linear Control Barrier Functions (MC-LCBF)â. We are attempting to steer the modelâs attention to fix two distinct failure modes: Generative Hallucination (drifting away from the image in long contexts) and Discriminative Bias (Step-0 âYes/Noâ errors on the POPE benchmark).
The Core Issue
Despite implementing a theoretically sound âDual-Phaseâ Control Barrier Function (CBF) using Softmax gradients and intervening directly on the Residual Stream, my results on the POPE benchmark are exactly identical to the baseline. It seems my steering vector θ* is either vanishing, getting normalized away, or is mathematically ineffective against the specific attention heads chosen.
I need a sanity check on my implementation logic regarding LlamaDecoderLayer vs. LlamaAttention and the efficacy of steering the Residual Stream.
1. The Theory: Dual-Phase Intervention & The Softmax Barrier
We address hallucination by splitting the generation process into two distinct phases, applying a Softmax Probability Mass Barrier h(x_t) to target the âSignal of Absence.â
Why Softmax? (Solving the âSignal of Absenceâ)
Previous methods used a raw Dot-Product Energy Barrier (Q ¡ K^T). While effective for forcing the model to look at existing objects, it fails when an object is absent.
- The Problem: If a prompt asks âIs there a dog?â and there is no dog, maximizing raw attention energy just forces the model to attend to random visual noise. The model, finding no âdogâ features, falls back on its language prior: âquestions about dogs are usually answered âYesâ.â
- Our Solution (The Push-Pull Mechanism): We replace raw energy with a Softmax Probability Barrier.
Because Softmax is zero-sum, increasing the probability mass on image tokens mathematically requires decreasing the probability mass on text tokens.
This gradient creates a Push-Pull Force: it pushes the Query Q towards visual features while actively repelling it from the text tokens (e.g., the word âdogâ in the prompt). This suppresses the unimodal language priors that cause hallucinations when visual evidence is missing.
Phase 1: Prefill Intervention (The âStep-0â Problem)
On benchmarks like POPE (Yes/No QA), the model computes its entire answer logit (âYesâ or âNoâ) immediately after processing the last token of the prompt (seq_len > 1). Standard auto-regressive steering is too late; the decision is already âlocked inâ by the prefill computation.
- Mechanism: We detect the prefill phase and intervene specifically on the Last Token of the Prompt. We calculate the Softmax distribution of this specific token against the image and inject
θ*to suppress text-based biases before the model commits to a decision.
Phase 2: Decoding Intervention (Generative Drift)
Over long captions (CHAIR benchmark), attention âdriftsâ away from visual grounding. We intervene on every generated token (seq_len == 1) to maintain the Softmax probability mass on the image above Ď, ensuring consistent grounding.
2. The Control Law: Taylor Expansion Linearization
To enforce the barrier, we treat the Transformer layer as a dynamical system and apply input-constrained Optimal Control.
We define the safe set as C = {x : h(x) ⼠Ď}. To find the minimal intervention θ* that projects the state back into C, we linearize the highly non-linear Softmax manifold using a First-Order Taylor Expansion around the current state x_t:
We then solve the QP (Quadratic Program):
This yields the closed-form solution:
This vector θ* represents the direction of steepest ascent on the probability manifoldâthe most efficient way to shift attention mass from text to image.
4. Code Snippet (Implementation Details)
Here is exactly how I implemented the gradient flow to ensure mathematically correct steering on the un-normalized residual stream:
-
Intercept
hidden_states(Residual Stream): I catchhidden_statesat the very start ofLlamaDecoderLayer.forward. -
Manual Normalization: To calculate what the Attention layer would see, I manually apply
self.input_layernorm(hidden_states). -
Forward Projection: I project this normalized state to QQ using the layerâs weights (
self.self_attn.q_proj) and apply RoPE. -
Backward Pass: I run
torch.autograd.gradfrom the Softmax output back to the un-normalizedhidden_states. -
Update: I add the resulting θâθâ in-place to the residual
hidden_states.
"""
Experiment 6: Dual-Phase MC-LCBF with Softmax Probability Barrier
=================================================================
Methodology:
1. **Phase 1 (Prefill):** Intervene on the LAST token of the prompt (the question).
- Target: Step-0 "Yes/No" decision.
- Barrier: Softmax Probability Mass on Image Tokens.
- Mechanism: Push (Image) / Pull (Text) via Softmax Gradient.
2. **Phase 2 (Decode):** Intervene on every generated token.
- Target: Generative Drift (CHAIR).
- Barrier: Attention Energy (or Softmax) to maintain grounding.
- Mechanism: Keep attention on image to prevent hallucination.
Theoretical Justification:
- **Linearization:** We use a local first-order Taylor expansion of the complex
softmax/layernorm manifold to derive a closed-form linear correction.
- **Push-Pull:** The gradient of the softmax barrier naturally suppressed text
priors when boosting image attention, acting as a "negative constraint" on
hallucination.
- **Memory Consistency:** We intervene on the Residual Stream (before Norm)
to ensure the correction $\theta^*$ permanently updates the causal memory.
Implementation Details:
- Hooks into `LlamaDecoderLayer` forward pass (Architecture Fix).
- Uses `torch.enable_grad()` locally to compute $\nabla_x h(x)$.
- Differentiates through `LayerNorm` and `Softmax` for accurate gradients.
- Applies $\theta^*$ correction in-place to the raw residual hidden states.
"""
import os
import sys
import math
import types
import json
import torch
import torch.nn.functional as F
import argparse
import numpy as np
from PIL import Image
from tqdm import tqdm
from transformers import AutoProcessor, LlavaForConditionalGeneration
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, repeat_kv
# --- Setup Paths ---
PROJECT_DIR = os.getcwd()
sys.path.append(PROJECT_DIR)
POPE_DATA_DIR = os.path.join(PROJECT_DIR, 'data/pope/coco')
POPE_IMAGE_DIR = os.path.join(PROJECT_DIR, 'data/mscoco/val2014')
RESULTS_DIR = os.path.join(PROJECT_DIR, 'results_exp6_dual_phase')
os.makedirs(RESULTS_DIR, exist_ok=True)
# --- Configuration ---
MODEL_ID = "llava-hf/llava-1.5-7b-hf"
CACHE_DIR = os.path.join(PROJECT_DIR, 'model_cache')
STEERING_CONFIG = {
"is_active": True,
# --- Prefill Barrier (Softmax Probability) ---
"prefill_tau": 0.2,
"prefill_alpha": 1.0, # Lowered for Residual Stream stability
# --- Decode Barrier (Energy or Softmax) ---
"decode_tau": 0.2,
"decode_alpha": 0.5, # Lowered for Residual Stream stability
# --- Shared ---
"steer_layers": list(range(10, 28)),
"img_start": 0,
"img_end": 0,
}
# =========================================================
# The Dual-Phase Decoder Layer Hook
# =========================================================
def dual_phase_decoder_wrapper(original_forward, layer_idx):
def forward(self, hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, position_embeddings=None, **kwargs):
# NOTE: self here is the LlamaDecoderLayer, NOT LlamaAttention
bsz, seq_len, hidden_size = hidden_states.size()
# Check if we should steer this layer
if STEERING_CONFIG["is_active"] and layer_idx in STEERING_CONFIG["steer_layers"]:
# --- Logic: Identify Target Token(s) ---
target_slice = slice(None)
do_steer = False
if seq_len > 1: # Prefill
# Target only the last token (Step-0 Decision token)
target_slice = slice(-1, None)
tau = STEERING_CONFIG["prefill_tau"]
alpha = STEERING_CONFIG["prefill_alpha"]
do_steer = True
else: # Decode
# Target the single token being generated
target_slice = slice(None)
tau = STEERING_CONFIG["decode_tau"]
alpha = STEERING_CONFIG["decode_alpha"]
do_steer = True
# --- A* PIVOT: Softmax Barrier on Residual Stream ---
if do_steer:
with torch.enable_grad():
# 1. Isolate the Target Residual State x_t
# We detach to start a fresh graph for the gradient calculation
x_full_resid = hidden_states
x_t_resid = x_full_resid[:, target_slice, :].clone().detach().requires_grad_(True)
# 2. Normalize x_t (Crucial Fix: Simulate Layer's behavior)
# We must pass gradients through this norm to align strictly with math.
x_t_norm = self.input_layernorm(x_t_resid)
# 3. Re-construct Q from x_t_norm
num_heads = self.self_attn.config.num_attention_heads
num_kv_heads = getattr(self.self_attn.config, "num_key_value_heads", num_heads)
head_dim = hidden_size // num_heads
q_probe = self.self_attn.q_proj(x_t_norm).view(bsz, x_t_resid.shape[1], num_heads, head_dim).transpose(1, 2)
# Apply RoPE
if position_embeddings is not None:
cos, sin = position_embeddings
# Handle possible shapes: [B/1, S, H, D], [B/1, S, D], or [S, D]
# For generated tokens or slicing, we need carefully select the right pos embeddings
if cos.dim() == 4:
cos_slice = cos[:, target_slice, :, :]
sin_slice = sin[:, target_slice, :, :]
elif cos.dim() == 3:
cos_slice = cos[:, target_slice, :]
sin_slice = sin[:, target_slice, :]
elif cos.dim() == 2:
cos_slice = cos[target_slice, :]
sin_slice = sin[target_slice, :]
else:
cos_slice, sin_slice = cos, sin
q_probe, _ = apply_rotary_pos_emb(q_probe, q_probe, cos_slice, sin_slice)
# 4. Construct K (Key Cache + Current)
img_start, img_end = STEERING_CONFIG["img_start"], STEERING_CONFIG["img_end"]
all_keys = None
if seq_len > 1: # Prefill
# Fix: Must normalize x_full_resid before projecting Keys!
with torch.no_grad():
# Normalize the full sequence for Key generation
x_full_norm = self.input_layernorm(x_full_resid.detach())
k_full = self.self_attn.k_proj(x_full_norm).view(bsz, seq_len, num_kv_heads, head_dim).transpose(1, 2)
if position_embeddings is not None:
k_full, _ = apply_rotary_pos_emb(k_full, k_full, cos, sin)
all_keys = k_full
else: # Decode
# Use past_key_values from cache
# Note: In `LlamaDecoderLayer`, `past_key_value` might be passed as argument
# But `LavaForConditionalGeneration` usually manages a global `past_key_values` cache object.
# The `past_key_values` arg in forward is usually the full tuple.
# We need to access the cache for THIS layer.
# In HF implementation, `past_key_value` (singular) is often passed for the specific layer
# or we access `past_key_values[layer_idx]`.
# Let's try to robustly find the cache.
# The signature has `past_key_value` (singular) which is legacy, and `past_key_values` (plural).
current_keys = None
# Check `past_key_value` (Layer-specific cache in some HF versions)
if past_key_value is not None:
current_keys = past_key_value[0] # [B, n_heads, seq, dim]
# Check `past_key_values` (Global cache tuple)
elif kwargs.get('past_key_values') is not None and len(kwargs.get('past_key_values')) > layer_idx:
current_keys = kwargs.get('past_key_values')[layer_idx][0]
if current_keys is not None:
all_keys = current_keys
# 5. Compute Softmax Barrier
if all_keys is not None:
if num_heads != num_kv_heads:
all_keys = repeat_kv(all_keys, num_heads // num_kv_heads)
# Q * K^T
attn_logits = torch.matmul(q_probe, all_keys.transpose(-1, -2)) / math.sqrt(head_dim)
# Softmax
attn_probs = F.softmax(attn_logits, dim=-1)
# Barrier Function h(x): Sum of probabilities on Image Tokens
valid_end = min(img_end, attn_probs.shape[-1])
valid_start = min(img_start, valid_end)
if valid_end > valid_start:
image_mass = attn_probs[:, :, :, valid_start:valid_end].sum(dim=-1)
h_val = image_mass.mean()
if h_val < tau:
# Gradient w.r.t x_t_resid (The Residual State)
grads = torch.autograd.grad(h_val, x_t_resid, retain_graph=False)[0]
grad_norm_sq = torch.sum(grads * grads)
# Scale Calculation
scale = (tau - h_val) / (grad_norm_sq + 1e-6)
# Clamp (Critical for Residual Stream Stability)
if scale > 0.5: scale = 0.5
theta = scale * grads
# Apply Correction to Residual Stream
hidden_states[:, target_slice, :] += alpha * theta.detach()
# Execute Original Forward (now with modified hidden_states)
return original_forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
position_embeddings=position_embeddings,
**kwargs
)
return forward
# =========================================================
# Generator Class
# =========================================================
class DualPhase_LCBF_Generator:
def __init__(self, model, processor):
self.model = model
self.processor = processor
self.image_token_id = processor.tokenizer.convert_tokens_to_ids("<image>")
print(f"\nInitializing Dual-Phase MC-LCBF Controller (Residual Stream Wrapped)...")
print(f" Prefill Target Mass: {STEERING_CONFIG['prefill_tau']}")
print(f" Decode Target Mass: {STEERING_CONFIG['decode_tau']}")
print(f"Injecting hooks into Layers {min(STEERING_CONFIG['steer_layers'])}-{max(STEERING_CONFIG['steer_layers'])}...")
# FIX: Wrap DecoderLayer, not Attention
layers = model.language_model.layers if hasattr(model, "language_model") else model.model.layers
for i, layer in enumerate(layers):
# We wrap the layer.forward, not layer.self_attn.forward
layer.forward = types.MethodType(
dual_phase_decoder_wrapper(layer.forward, i),
layer,
)
def generate(self, image, prompt, steering=True):
STEERING_CONFIG["is_active"] = steering
inputs = self.processor(text=prompt, images=image, return_tensors="pt")
# Estimate image span (LLaVA-1.5 specific)
STEERING_CONFIG["img_start"] = 1
STEERING_CONFIG["img_end"] = 577
input_ids = inputs.input_ids.to(self.model.device)
pixel_values = inputs.pixel_values.to(self.model.device, dtype=torch.float16)
with torch.no_grad():
output_ids = self.model.generate(
input_ids,
pixel_values=pixel_values,
max_new_tokens=5,
do_sample=False,
use_cache=True
)
return self.processor.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
# =========================================================
# Main Execution
# =========================================================
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--type", type=str, default="random", choices=['random', 'popular', 'adversarial'])
parser.add_argument("--num_images", type=int, default=None, help="Number of images to process")
args = parser.parse_args()
# Init
print(f"Loading {MODEL_ID}...")
model = LlavaForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation="eager",
).to("cuda")
processor = AutoProcessor.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)
gen = DualPhase_LCBF_Generator(model, processor)
# Load POPE Data
question_file = os.path.join(POPE_DATA_DIR, f"coco_pope_{args.type}.json")
out_file = os.path.join(RESULTS_DIR, f"exp6_pope_{args.type}_dual_phase.jsonl")
questions = []
with open(question_file, 'r') as f:
for line in f:
questions.append(json.loads(line))
if args.num_images is not None:
questions = questions[:args.num_images]
print(f"Evaluating {len(questions)} samples on {args.type}...")
results = []
for item in tqdm(questions):
try:
image_path = os.path.join(POPE_IMAGE_DIR, item['image'])
if not os.path.exists(image_path): continue
image = Image.open(image_path).convert("RGB")
prompt = f"USER: <image>\n{item['text']}\nASSISTANT:"
# Baseline (No Steering)
ans_base = gen.generate(image, prompt, steering=False).lower().strip()
pred_base = "yes" if "yes" in ans_base else ("no" if "no" in ans_base else "no")
# Steered (Dual Phase)
ans_steer = gen.generate(image, prompt, steering=True).lower().strip()
pred_steer = "yes" if "yes" in ans_steer else ("no" if "no" in ans_steer else "no")
res = {
'question': item['text'],
'label': item['label'],
'baseline_pred': pred_base,
'steered_pred': pred_steer
}
results.append(res)
with open(out_file, 'a') as f:
f.write(json.dumps(res) + "\n")
except Exception as e:
print(f"Error: {e}")
# Metrics
acc_base = sum([1 for x in results if x['label'] == x['baseline_pred']]) / len(results)
acc_steer = sum([1 for x in results if x['label'] == x['steered_pred']]) / len(results)
print(f"\nFinal Results Exp 6 ({args.type}):")
print(f"Baseline Accuracy: {acc_base:.4f}")
print(f"Steered Accuracy: {acc_steer:.4f}")
5. Why is POPE Not Improving?
This is my main concern. Even with alpha=1.0 (or 4.0) and tau=0.5 (forcing 50% attention on the image), the final generated âYes/Noâ tokens on POPE are identical to the baseline.
My Questions for the Community:
-
RMSNorm Gradient Vanishing: Does backpropagating through
LlamaRMSNormdampen the gradient so much that the update to the residual stream is negligible? Or does the non-linearity of RMSNorm distort the assumption thatgrad(norm(x))points in a useful direction forx? -
KV Cache Mismatch: Since
LlamaDecoderLayermodifies the KV cache in-place during the forward pass, does my âlook-aheadâ calculation of KK (derived from a detachedx_fullwhich I manually project) create a mismatch with the actualpast_key_valuesused by the model? -
Residual Dampening: Is adding a vector to the residual stream
(dim 4096) ineffective because the subsequent MLP block or next layerâs Norm immediately crushes this perturbation?
Any insights on steering the Residual Stream vs. Post-Norm states in Llama-2 would be incredibly helpful. Are my gradients flowing correctly through that manual Norm step?
Thanks!