[LLaVA-1.5] Very low hallucination rate & weak attention correlation in "Attention Gap" experiment – Is my implementation of output_attentions correct?

Context & Objective

I am running an experiment to validate the “Attention Gap” hypothesis for Hallucination Reduction.

  • Hypothesis: Hallucinated tokens (objects generated but not in the image) should have statistically lower Attention Mass on the visual tokens compared to “Hit” tokens (correct objects).

  • Goal: Establish a baseline point-biserial correlation between Visual Attention Mass and Hallucination Status (0/1) before applying a steering intervention.

Experimental Setup

  • Model: llava-hf/llava-1.5-7b-hf (loaded in float16).

  • Attention Implementation: Forced attn_implementation="eager" to support output_attentions=True.

  • Dataset: MSCOCO 2014 Validation Set (Standard split for CHAIR metric).

  • Metric: CHAIR (Caption Hallucination Assessment with Image Relevance), adapted to map individual tokens to “Hit” or “Hallucination” spans.

The Implementation Logic

I am manually decoding token-by-token. At each step t:

  1. Visual Slice: I locate the <image> tokens in the input. For LLaVA-1.5, I confirm exactly 576 visual tokens are present.

  2. Attention Mass Calculation:

    • I extract outputs.attentions (tuple of 32 layers).

    • I select Layers 10–25 (Mid-layers, where visual-semantic alignment typically happens).

    • I sum the attention weights over the 576 visual tokens for the last generated token q_t.

    • I mean-pool across the selected layers and heads.

  3. Alignment: I use the CHAIR metric to identify which words in the caption are hallucinations. I then map the generated tokens to these words using character-span alignment.

The Problem / Anomalies

After analyzing ~7,200 object tokens, my results are statistically suspicious:

  1. Extremely Low Hallucination Rate (0.8%):

    • Result: Out of 7,218 object tokens, only 62 were classified as hallucinations.

    • Expectation: Standard LLaVA-1.5 usually has a CHAIR$_i$ score between 5%–15% on COCO. A 0.8% rate implies the model is performing impossibly well, or my evaluation is missing nearly all hallucinations.

  2. Weak Significance:

    • Result: Mean Mass (Hit): 0.0912 vs Mean Mass (Hall): 0.0819.

    • Correlation: r = -0.0187 (p=0.11).

    • While the “gap” exists (Hits have higher mass), the p-value is insignificant, likely due to the extreme class imbalance (only 62 negative samples).

My Code Implementation

Below is the core logic for extracting attention and aligning tokens. I suspect I might be calculating the mass incorrectly or missing a subtlety in how eager attention handles the KV cache for visual tokens.

import torch

from transformers import AutoProcessor, LlavaForConditionalGeneration

import os

import json

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

from tqdm import tqdm

from PIL import Image

import sys

import re




# --- Setup Paths ---

PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

sys.path.append(PROJECT_DIR)

from data_setup import load_llava_model

from utils.chair import CHAIR




class Experiment1_Fixed:

def __init__(self, model, processor, coco_img_dir, output_dir="results_exp1_fixed"):

self.model = model

self.processor = processor

self.coco_img_dir = coco_img_dir

self.output_dir = output_dir

        os.makedirs(output_dir, exist_ok=True)




if hasattr(self.model.config, "image_token_index"):

self.image_token_id = self.model.config.image_token_index

else:

self.image_token_id = 32000

self._printed_debug = False




def compute_attention_mass(self, attentions, start_idx, end_idx):

"""

        Sums attention mass over the visual tokens (start_idx to end_idx).

        Averages across layers 10-25 (Deep visual semantic layers).

        """

# LLaVA-7B Visual Layers (Middle of the network)

        target_layers = range(10, 26)




        total_mass = 0.0

        valid_layers = 0




for i in target_layers:

if i >= len(attentions): break




# Shape: (batch, heads, 1, seq_len) -> (heads, seq_len)

            attn_matrix = attentions[i][0, :, -1, :] 




# Safety check

if end_idx > attn_matrix.shape[-1]: continue



# We want total attention paid to the image

            layer_mass = attn_matrix[:, start_idx:end_idx].sum(dim=-1).mean().item()

            total_mass += layer_mass

            valid_layers += 1




if valid_layers == 0: return 0.0

return total_mass / valid_layers




    @torch.inference_mode()

def run_inference(self, image_files, num_images=100):

        traces = []

        print(f"Running inference on {num_images} images...")




for img_filename in tqdm(image_files[:num_images]):

            image_path = os.path.join(self.coco_img_dir, img_filename)

try:

                image = Image.open(image_path).convert("RGB")

except:

continue




# Parse Image ID

try:

                image_id = int(img_filename.split('_')[-1].split('.')[0])

except:

continue




            prompt = "USER: <image>\nDescribe this image in detail. ASSISTANT:"

# Processor automatically expands <image> into 576 tokens in input_ids

            inputs = self.processor(text=prompt, images=image, return_tensors="pt")




            input_ids = inputs.input_ids.to(self.model.device)

            pixel_values = inputs.pixel_values.to(self.model.device, dtype=torch.float16)




# --- PREFILL ---

            prefill_out = self.model(

                input_ids=input_ids,

                pixel_values=pixel_values,

                use_cache=True,

                return_dict=True

            )

            past_key_values = prefill_out.past_key_values



# Find all indices where input is <image> (32000)

            image_indices = (input_ids[0] == self.image_token_id).nonzero(as_tuple=True)[0]

if len(image_indices) == 0:

continue

# The start is the first occurrence

            start_idx = image_indices[0].item()

# The end is the last occurrence + 1

            end_idx = image_indices[-1].item() + 1

# SANITY CHECK: LLaVA-1.5 should have exactly 576 visual tokens

            visual_count = end_idx - start_idx

if visual_count != 576:

                print(f"\n[WARNING] Found {visual_count} visual tokens instead of 576. Skipping.")

continue



if not self._printed_debug:

                print(f"\n--- DEBUG CONFIRMED ---")

                print(f"Image Tokens Start: {start_idx}")

                print(f"Image Tokens End:   {end_idx}")

                print(f"Total Visual Mass Range: {visual_count} tokens")

                print(f"-----------------------\n")

self._printed_debug = True




# --- DECODING LOOP ---

            next_token = torch.argmax(prefill_out.logits[:, -1, :], dim=-1).unsqueeze(1)

            generated_ids = []

            token_trace = []

# Using 'full_prefix' logic for alignment

            current_prefix = self.processor.decode(input_ids[0], skip_special_tokens=True)




for _ in range(80):

                outputs = self.model(

                    input_ids=next_token,

                    past_key_values=past_key_values,

                    use_cache=True,

                    output_attentions=True

                )




# Calculate Mass

                mass = self.compute_attention_mass(outputs.attentions, start_idx, end_idx)

                token_id = next_token.item()

                generated_ids.append(token_id)

# We track the text growing step by step

                token_text = self.processor.decode([token_id], skip_special_tokens=True)

# Update prefix for next step alignment

                trace_entry = {

"token": token_text,

"mass": mass,

"span_start": len(current_prefix),

"span_end": len(current_prefix) + len(token_text)

                }

                current_prefix += token_text

                token_trace.append(trace_entry)




                past_key_values = outputs.past_key_values

                next_token = torch.argmax(outputs.logits[:, -1, :], dim=-1).unsqueeze(1)




if next_token.item() == self.processor.tokenizer.eos_token_id:

break




            full_caption = self.processor.decode(generated_ids, skip_special_tokens=True)




            traces.append({

"image_id": image_id,

"caption": full_caption,

"trace": token_trace

            })




return traces




def align_and_evaluate(self, traces, chair_evaluator):

        print("\nRunning CHAIR evaluation...")

# Save temp file for CHAIR

        chair_input = [{"image_id": t["image_id"], "caption": t["caption"]} for t in traces]

        temp_file = os.path.join(self.output_dir, "temp_caps.json")

with open(temp_file, "w") as f:

            json.dump(chair_input, f)




# Run CHAIR

        chair_results = chair_evaluator.compute_chair(temp_file, "image_id", "caption")

        print("CHAIR metrics:", chair_results.get("overall_metrics", {}))




        hits_mass = []

        halls_mass = []




for i, res in enumerate(chair_results["sentences"]):

            caption = traces[i]["caption"]

            trace = traces[i]["trace"]




# Get Raw Words

            hall_words = set([h[0].lower() for h in res["mscoco_hallucinated_words"]])

            object_words = set([w.lower() for w in res["words"]]) # All objects detected




# Find Spans in Caption

            word_spans = []

for word in object_words:

# Use simple find for robustness or regex

                start = 0

while True:

                    idx = caption.lower().find(word, start)

if idx == -1: break

                    word_spans.append({

"start": idx,

"end": idx + len(word),

"type": "HALL" if word in hall_words else "HIT"

                    })

                    start = idx + len(word)




# Map tokens to spans

for token in trace:

if not token["token"].strip(): continue

                t_start = token["span_start"]

                t_end = token["span_end"]

                t_mass = token["mass"]




# Check overlaps

for span in word_spans:

# Logic: Do the token and the word overlap significantly?

                    overlap_start = max(t_start, span["start"])

                    overlap_end = min(t_end, span["end"])

if overlap_start < overlap_end:

if span["type"] == "HALL":

                            halls_mass.append(t_mass)

else:

                            hits_mass.append(t_mass)

# Break ensures we don't double count one token for multiple overlaps 

# (though rare with non-overlapping objects)

break 




return hits_mass, halls_mass




def analyze(self, hits, halls):

        hits = np.array(hits)

        halls = np.array(halls)




        print("\n--- FINAL CORRECTED RESULTS ---")

        print(f"Total tokens analyzed: {len(hits) + len(halls)}")

        print(f"Hits: {len(hits)}")

        print(f"Hallucinations: {len(halls)}")




if len(halls) < 5 or len(hits) < 5:

            print("Insufficient data for statistics.")

return




        print(f"Mean Mass (Hit): {hits.mean():.4f}")

        print(f"Mean Mass (Hall): {halls.mean():.4f}")




# Point Biserial

        y = np.concatenate([np.zeros(len(hits)), np.ones(len(halls))])

        x = np.concatenate([hits, halls])

        corr, p = stats.pointbiserialr(y, x)

        print(f"Correlation: {corr:.4f}")

        print(f"P-value: {p:.4e}")




# Plot

        plt.figure(figsize=(10, 6))

        sns.kdeplot(hits, label=f"Hits (Mean: {hits.mean():.2f})", fill=True, color='blue')

        sns.kdeplot(halls, label=f"Hallucinations (Mean: {halls.mean():.2f})", fill=True, color='red')

        plt.title(f"Experiment 1: Attention Gap (Corr: {corr:.3f})")

        plt.xlabel("Attention Mass (Sum over 576 Visual Tokens)")

        plt.legend()

        plt.savefig(os.path.join(self.output_dir, "exp1_final_plot.png"))

        print(f"Plot saved to {self.output_dir}")




if __name__ == "__main__":

# Load Model

    model_id = "llava-hf/llava-1.5-7b-hf"

    model = LlavaForConditionalGeneration.from_pretrained(

        model_id, 

        torch_dtype=torch.float16, 

        low_cpu_mem_usage=True,

        attn_implementation="eager",

    ).to("cuda")

    processor = AutoProcessor.from_pretrained(model_id)

# Check paths

    anno_dir = os.path.join(os.getcwd(), 'data/mscoco/annotations')

    img_dir = os.path.join(os.getcwd(), 'data/mscoco/val2014')

if not os.path.exists(anno_dir) or not os.path.exists(img_dir):

        print("Error: Data not found. Please run data_setup.py")

        sys.exit(1)




    evaluator = CHAIR(anno_dir)

    exp = Experiment1_Fixed(model, processor, img_dir)

# Get images

    all_images = sorted([f for f in os.listdir(img_dir) if f.endswith(".jpg")])

# Run

    traces = exp.run_inference(all_images, num_images=100)

    hits, halls = exp.align_and_evaluate(traces, evaluator)

    exp.analyze(hits, halls)

utils/chair.py

'''

Copied from: https://github.com/LisaAnne/Hallucination/blob/master/utils/chair.py




Modified by: Maxlinn




1. adapt calculation of CHAIR-i and CHAIR-s for Python3, supports for both json and jsonl file input.

2. integrate synonyms.txt to make the script standalone.

3. remove machine-translation based metrics BLEU-n, CIDEr, ROGUE

4. add new metric Recall, which represents the node words(i.e. lemmas of objects) coverage overall.

5. add pickle cache mechanism to make it fast for repetitive evaluations.

'''





import os

import sys

import nltk

import json

from pattern.en import singularize

import argparse

import tqdm

import pickle

from collections import defaultdict





# copied from: https://github.com/LisaAnne/Hallucination/blob/master/data/synonyms.txt

synonyms_txt = '''

person, girl, boy, man, woman, kid, child, chef, baker, people, adult, rider, children, baby, worker, passenger, sister, biker, policeman, cop, officer, lady, cowboy, bride, groom, male, female, guy, traveler, mother, father, gentleman, pitcher, player, skier, snowboarder, skater, skateboarder, person, woman, guy, foreigner, child, gentleman, caller, offender, coworker, trespasser, patient, politician, soldier, grandchild, serviceman, walker, drinker, doctor, bicyclist, thief, buyer, teenager, student, camper, driver, solider, hunter, shopper, villager

bicycle, bike, bicycle, bike, unicycle, minibike, trike

car, automobile, van, minivan, sedan, suv, hatchback, cab, jeep, coupe, taxicab, limo, taxi

motorcycle, scooter,  motor bike, motor cycle, motorbike, scooter, moped

airplane, jetliner, plane, air plane, monoplane, aircraft, jet, jetliner, airbus, biplane, seaplane

bus, minibus, trolley

train, locomotive, tramway, caboose

truck, pickup, lorry, hauler, firetruck

boat, ship, liner, sailboat, motorboat, dinghy, powerboat, speedboat, canoe, skiff, yacht, kayak, catamaran, pontoon, houseboat, vessel, rowboat, trawler, ferryboat, watercraft, tugboat, schooner, barge, ferry, sailboard, paddleboat, lifeboat, freighter, steamboat, riverboat, battleship, steamship

traffic light, street light, traffic signal, stop light, streetlight, stoplight

fire hydrant, hydrant

stop sign

parking meter

bench, pew

bird, ostrich, owl, seagull, goose, duck, parakeet, falcon, robin, pelican, waterfowl, heron, hummingbird, mallard, finch, pigeon, sparrow, seabird, osprey, blackbird, fowl, shorebird, woodpecker, egret, chickadee, quail, bluebird, kingfisher, buzzard, willet, gull, swan, bluejay, flamingo, cormorant, parrot, loon, gosling, waterbird, pheasant, rooster, sandpiper, crow, raven, turkey, oriole, cowbird, warbler, magpie, peacock, cockatiel, lorikeet, puffin, vulture, condor, macaw, peafowl, cockatoo, songbird

cat, kitten, feline, tabby

dog, puppy, beagle, pup, chihuahua, schnauzer, dachshund, rottweiler, canine, pitbull, collie, pug, terrier, poodle, labrador, doggie, doberman, mutt, doggy, spaniel, bulldog, sheepdog, weimaraner, corgi, cocker, greyhound, retriever, brindle, hound, whippet, husky

horse, colt, pony, racehorse, stallion, equine, mare, foal, palomino, mustang, clydesdale, bronc, bronco

sheep, lamb, ram, lamb, goat, ewe

cow, cattle, oxen, ox, calf, cattle, holstein, heifer, buffalo, bull, zebu, bison 

elephant

bear, panda

zebra

giraffe

backpack, knapsack

umbrella

handbag, wallet, purse, briefcase

tie, bow, bow tie

suitcase, suit case, luggage

frisbee

skis, ski

snowboard

sports ball, ball

kite

baseball bat

baseball glove

skateboard

surfboard, longboard, skimboard, shortboard, wakeboard

tennis racket, racket

bottle

wine glass

cup

fork

knife, pocketknife, knive

spoon

bowl, container

banana

apple

sandwich, burger, sub, cheeseburger, hamburger

orange

broccoli

carrot

hot dog

pizza

donut, doughnut, bagel

cake,  cheesecake, cupcake, shortcake, coffeecake, pancake

chair, seat, stool

couch, sofa, recliner, futon, loveseat, settee, chesterfield 

potted plant, houseplant

bed

dining table, table, desk

toilet, urinal, commode, toilet, lavatory, potty

tv, monitor, televison, television

laptop, computer, notebook, netbook, lenovo, macbook, laptop computer

mouse

remote

keyboard

cell phone, mobile phone, phone, cellphone, telephone, phon, smartphone, iPhone

microwave

oven, stovetop, stove, stove top oven

toaster

sink

refrigerator, fridge, fridge, freezer

book

clock

vase

scissors

teddy bear, teddybear

hair drier, hairdryer

toothbrush

'''





def combine_coco_captions(annotation_path):




if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'val')):

raise Exception("Please download MSCOCO caption annotations for val set")

if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'train')):

raise Exception("Please download MSCOCO caption annotations for train set")




    val_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'val')))

    train_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'train')))

    all_caps = {'info': train_caps['info'],

'licenses': train_caps['licenses'],

'images': val_caps['images'] + train_caps['images'],

'annotations': val_caps['annotations'] + train_caps['annotations']}




return all_caps 




def combine_coco_instances(annotation_path):




if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'val')):

raise Exception("Please download MSCOCO instance annotations for val set")

if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'train')):

raise Exception("Please download MSCOCO instance annotations for train set")




    val_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'val')))

    train_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'train')))

    all_instances = {'info': train_instances['info'],

'licenses': train_instances['licenses'],

'type': train_instances['licenses'],

'categories': train_instances['categories'],

'images': train_instances['images'] + val_instances['images'],

'annotations': val_instances['annotations'] + train_instances['annotations']}




return all_instances 




class CHAIR(object):




def __init__(self, coco_path):




self.imid_to_objects = defaultdict(list) # later become a dict of sets




self.coco_path = coco_path




#read in synonyms

        synonyms = synonyms_txt.splitlines()

        synonyms = [s.strip().split(', ') for s in synonyms]

self.mscoco_objects = [] #mscoco objects and *all* synonyms

self.inverse_synonym_dict = {}

for synonym in synonyms:

self.mscoco_objects.extend(synonym)

for s in synonym:

self.inverse_synonym_dict[s] = synonym[0]




#Some hard coded rules for implementing CHAIR metrics on MSCOCO

#common 'double words' in MSCOCO that should be treated as a single word

        coco_double_words = ['motor bike', 'motor cycle', 'air plane', 'traffic light', 'street light', 'traffic signal', 'stop light', 'fire hydrant', 'stop sign', 'parking meter', 'suit case', 'sports ball', 'baseball bat', 'baseball glove', 'tennis racket', 'wine glass', 'hot dog', 'cell phone', 'mobile phone', 'teddy bear', 'hair drier', 'potted plant', 'bow tie', 'laptop computer', 'stove top oven', 'hot dog', 'teddy bear', 'home plate', 'train track']

#Hard code some rules for special cases in MSCOCO

#qualifiers like 'baby' or 'adult' animal will lead to a false fire for the MSCOCO object 'person'.  'baby bird' --> 'bird'.

        animal_words = ['bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'animal', 'cub']

#qualifiers like 'passenger' vehicle will lead to a false fire for the MSCOCO object 'person'.  'passenger jet' --> 'jet'.

        vehicle_words = ['jet', 'train']

#double_word_dict will map double words to the word they should be treated as in our analysis

self.double_word_dict = {}

for double_word in coco_double_words:

self.double_word_dict[double_word] = double_word

for animal_word in animal_words:

self.double_word_dict['baby %s' %animal_word] = animal_word

self.double_word_dict['adult %s' %animal_word] = animal_word

for vehicle_word in vehicle_words:

self.double_word_dict['passenger %s' %vehicle_word] = vehicle_word

self.double_word_dict['bow tie'] = 'tie'

self.double_word_dict['toilet seat'] = 'toilet'

self.double_word_dict['wine glas'] = 'wine glass'

self.get_annotations()




def _load_generated_captions_into_evaluator(self, cap_file, image_id_key, caption_key):




'''

        Meant to save time so imid_to_objects does not always need to be recomputed.

        '''

#Read in captions        

self.caps, self.eval_imids = load_generated_captions(cap_file, image_id_key, caption_key)

assert len(self.caps) == len(self.eval_imids)




def caption_to_words(self, caption):

'''

        Input: caption

        Output: MSCOCO words in the caption

        '''

#standard preprocessing

        words = nltk.word_tokenize(caption.lower())

        words = [singularize(w) for w in words]

#replace double words

        i = 0

        double_words = []

        idxs = []

while i < len(words):

           idxs.append(i) 

           double_word = ' '.join(words[i:i+2])

if double_word in self.double_word_dict: 

               double_words.append(self.double_word_dict[double_word])

               i += 2

else:

               double_words.append(words[i])

               i += 1

        words = double_words

#toilet seat is not chair (sentences like "the seat of the toilet" will fire for "chair" if we do not include this line)

if ('toilet' in words) & ('seat' in words): words = [word for word in words if word != 'seat']

#get synonyms for all words in the caption

        idxs = [idxs[idx] for idx, word in enumerate(words) \

if word in set(self.mscoco_objects)]

        words = [word for word in words if word in set(self.mscoco_objects)]

        node_words = []

for word in words:

            node_words.append(self.inverse_synonym_dict[word])

#return all the MSCOCO objects in the caption

return words, node_words, idxs, double_words




def get_annotations_from_segments(self):

'''

        Add objects taken from MSCOCO segmentation masks

        '''




        coco_segments = combine_coco_instances(self.coco_path )

        segment_annotations = coco_segments['annotations']




#make dict linking object name to ids

        id_to_name = {} #dict with id to synsets 

for cat in coco_segments['categories']:

            id_to_name[cat['id']] = cat['name']




for i, annotation in enumerate(segment_annotations):

            sys.stdout.write("\rGetting annotations for %d/%d segmentation masks" 

                              %(i, len(segment_annotations)))

            imid = annotation['image_id']

            node_word = self.inverse_synonym_dict[id_to_name[annotation['category_id']]]

self.imid_to_objects[imid].append(node_word)

        print("\n")




def get_annotations_from_captions(self):

'''

        Add objects taken from MSCOCO ground truth captions 

        '''




        coco_caps = combine_coco_captions(self.coco_path)

        caption_annotations = coco_caps['annotations']




for i, annotation in enumerate(caption_annotations):

            sys.stdout.write('\rGetting annotations for %d/%d ground truth captions' 

                              %(i, len(coco_caps['annotations'])))

            imid = annotation['image_id']

            _, node_words, _, _ = self.caption_to_words(annotation['caption'])

# note here is update, so call get_annotations_from_segments first

self.imid_to_objects[imid].extend(node_words)

        print("\n")





def get_annotations(self):




'''

        Get annotations from both segmentation and captions.  Need both annotation types for CHAIR metric.

        '''

self.get_annotations_from_segments() 

self.get_annotations_from_captions()

# deduplicate

for imid in self.imid_to_objects:

self.imid_to_objects[imid] = set(self.imid_to_objects[imid])




def compute_chair(self, cap_file, image_id_key, caption_key):

'''

        Given ground truth objects and generated captions, determine which sentences have hallucinated words.

        '''

self._load_generated_captions_into_evaluator(cap_file, image_id_key, caption_key)

        imid_to_objects = self.imid_to_objects

        caps = self.caps

        eval_imids = self.eval_imids

        num_caps = 0.

        num_hallucinated_caps = 0.

        hallucinated_word_count = 0.

        coco_word_count = 0.

# :add:

        num_recall_gt_objects = 0.

        num_gt_objects = 0.




        output = {'sentences': []} 

for i in tqdm.trange(len(caps)):

            cap :str = caps[i]

            imid :int = eval_imids[i]

#get all words in the caption, as well as corresponding node word

            words, node_words, idxs, raw_words = self.caption_to_words(cap) 

            gt_objects = imid_to_objects[imid]

            cap_dict = {'image_id': imid, 

'caption': cap,

'mscoco_hallucinated_words': [],

'mscoco_gt_words': list(gt_objects),

'mscoco_generated_words': list(node_words),

'hallucination_idxs': [], 

'words': raw_words 

                        }




# :add:

            cap_dict['metrics'] = {'CHAIRs': 0,

'CHAIRi': 0,

'Recall': 0}

#count hallucinated words

            coco_word_count += len(node_words) 

            hallucinated = False

# add

            recall_gt_objects = set()

for word, node_word, idx in zip(words, node_words, idxs):

if node_word not in gt_objects:

                    hallucinated_word_count += 1 

                    cap_dict['mscoco_hallucinated_words'].append((word, node_word))

                    cap_dict['hallucination_idxs'].append(idx)

                    hallucinated = True

else:

                    recall_gt_objects.add(node_word)

#count hallucinated caps

            num_caps += 1

if hallucinated:

               num_hallucinated_caps += 1

# add

            num_gt_objects += len(gt_objects)

            num_recall_gt_objects += len(recall_gt_objects)

            cap_dict['metrics']['CHAIRs'] = int(hallucinated)

            cap_dict['metrics']['CHAIRi'] = 0.

            cap_dict['metrics']['Recall'] = 0.

if len(words) > 0:

                cap_dict['metrics']['CHAIRi'] = len(cap_dict['mscoco_hallucinated_words'])/float(len(words))

# add

if len(gt_objects) > 0:

                cap_dict['metrics']['Recall'] = len(recall_gt_objects) / len(gt_objects)

            output['sentences'].append(cap_dict)

        chair_s = (num_hallucinated_caps/num_caps)

        chair_i = (hallucinated_word_count/coco_word_count)

# add

        recall = num_recall_gt_objects / num_gt_objects

        output['overall_metrics'] = {'CHAIRs': chair_s,

'CHAIRi': chair_i,

'Recall': recall}

return output 




def load_generated_captions(cap_file, image_id_key:str, caption_key:str):

#Read in captions        

# it should be list of dict

    ext = os.path.splitext(cap_file)[-1]

if ext == '.json':

        caps = json.load(open(cap_file))

elif ext == '.jsonl':

        caps = [json.loads(s) for s in open(cap_file)]

else:

raise ValueError(f'Unspported extension {ext} for cap_file: {cap_file}')




# list of int

    imids = [obj[image_id_key] for obj in caps]

# list of str

    caps = [obj[caption_key] for obj in caps]

return caps, imids




def save_hallucinated_words(cap_file, cap_dict): 

with open(cap_file, 'w') as f:

        json.dump(cap_dict, f, indent=2, ensure_ascii=False)




def print_metrics(hallucination_cap_dict, quiet=False):

    sentence_metrics = hallucination_cap_dict['overall_metrics']

for k, v in sentence_metrics.items():

        k_str = str(k).ljust(10)

        v_str = f'{v * 100:.01f}'

        print(k_str, v_str, sep=': ')

if __name__ == '__main__':

    parser = argparse.ArgumentParser()

    parser.add_argument("--cap_file", type=str, default='',

                        help="path towards json or jsonl saving image ids and their captions in list of dict.")

    parser.add_argument("--image_id_key", type=str, default="image_id",

                        help="in each dict of cap_file, which key stores image id of coco.")

    parser.add_argument("--caption_key", type=str, default="caption",

                        help="in each dict of cap_file, which key stores caption of the image.")

    parser.add_argument("--cache", type=str, default="chair.pkl",

                        help="pre inited CHAIR evaluator object, for fast loading.")

    parser.add_argument("--coco_path", type=str, default='coco_annotations',

                        help="only use for regenerating CHAIR evaluator object, will be ignored if uses cached evaluator.")

    parser.add_argument("--save_path", type=str, default="",

                        help="saving CHAIR evaluate and results to json, useful for debugging the caption model.")

    args = parser.parse_args()




if args.cache and os.path.exists(args.cache):

        evaluator = pickle.load(open(args.cache, 'rb'))

        print(f"loaded evaluator from cache: {args.cache}")

else:

        print(f"cache not setted or not exist yet, building from scratch...")

        evaluator = CHAIR(args.coco_path)

        pickle.dump(evaluator, open(args.cache, 'wb'))

        print(f"cached evaluator to: {args.cache}")




    cap_dict = evaluator.compute_chair(args.cap_file, args.image_id_key, args.caption_key) 

    print_metrics(cap_dict)

if args.save_path:

        save_hallucinated_words(args.save_path, cap_dict)
  1. Logic Check: Is the align_and_evaluate logic flawed in a way that ignores partial matches or synonyms, leading to the low hallucination count?

  2. Attention Validity: Is calculating the sum() of attention weights over the visual tokens (specifically layers 10-25) the standard proxy for “Visual Grounding” in LLaVA?

    Any insights on why the hallucination capture rate is so low or if the attention extraction logic misses the mark would be greatly appreciated

1 Like

Seems few bugs…?


1) Why your hallucination rate is (almost certainly) artificially low

There are two independent bugs in your align_and_evaluate logic that can easily collapse your hallucination rate to ~0–1% even if CHAIR itself is normal.

A. Your token spans are offset by the prompt, but your word spans are caption-only

You build token spans using:

current_prefix = processor.decode(input_ids[0], skip_special_tokens=True)
span_start = len(current_prefix)

So span_start/span_end are measured in the string:

decoded(prompt + placeholders removed) + decoded(generated tokens)

But you search word spans inside:

caption = traces[i]["caption"]   # generated-only
idx = caption.lower().find(word, start)

So your word spans are measured in:

decoded(generated tokens only)

That mismatch means overlaps are not “meaningful overlaps” unless something else accidentally cancels the offset. Even if you still get some matches, the mapping becomes extremely unstable and biased.

Fix: track spans relative to the assistant caption only.

  • simplest: current_prefix = "" before decoding loop
  • or store an offset = len(decoded_prompt) and subtract it from all token spans before overlap tests.

Also: use clean_up_tokenization_spaces=False consistently when decoding step-by-step (details below).

Hugging Face’s LLaVA docs explicitly call out formatting and tokenization behaviors that can bite you here (prompt formatting + token expansion + text truncation). (Hugging Face)


B. You’re treating all words as “object words” (massively inflating “Hits”)

You do:

object_words = set([w.lower() for w in res["words"]])  # "All objects detected"

But in your CHAIR code, res["words"] is raw_words (actually double_words), i.e. the full tokenized caption, including function words like “a”, “in”, “with”, “the”, etc. (not just MSCOCO objects).

So you create spans for essentially every word in the caption, then label as HALL only if it’s in hall_words. This guarantees:

  • a huge number of spans get labeled HIT by default
  • hallucinations become a tiny fraction of “all matched word tokens”

That single line can easily push you from “~5–15% CHAIR_i” intuition to “~0.8% token hallucination” in your own aggregation.

Fix: build spans only for MSCOCO object words, not raw caption words.

CHAIR gives you exactly what you need:

  • mscoco_hallucinated_words
  • (implicitly) the object-only tokenization via caption_to_words

The original CHAIR paper defines the metric specifically over the MSCOCO object set + synonyms, not all caption tokens. (ACL Anthology)
And newer work (e.g., ALOHa) explicitly notes CHAIR’s “fixed object set” limitation—so it’s important you’re truly using the object-only subset when doing token-level analyses. (arXiv)


2) A robust alignment recipe that stays faithful to CHAIR

Step 1 — reproduce CHAIR tokenization for the caption

Instead of using res["words"], re-run:

words, node_words, idxs, raw_words = chair_evaluator.caption_to_words(caption)
# words/node_words are COCO-object-only lists
# idxs are the positions of those object words inside raw_words
# raw_words is the full “double_words” tokenization
hall_idx_set = set(res["hallucination_idxs"])   # indices into raw_words
obj_idx_set  = set(idxs)                        # indices into raw_words for COCO objects

Now you have:

  • which raw word positions are COCO objects
  • which raw word positions are hallucinations

No synonym/lemma drift. No “all words are objects” mistake.

Step 2 — map raw_words (CHAIR tokens) to character spans in the caption

Do a left-to-right scan (not set() + find()), preserving duplicates and order.

Step 3 — map generated BPE token spans to those character spans

Critical: ensure the token-by-token text you append matches the final decoded caption string.

Use consistent decoding settings. HF examples frequently recommend clean_up_tokenization_spaces=False to avoid silent whitespace/punctuation normalization differences when aligning. (Hugging Face)

So during decoding, do:

token_text = processor.decode([token_id],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)

and also decode the full caption with the same cleanup setting.


3) Your attention extraction: mostly reasonable, but there are 3 important caveats

A. The “sum over visual tokens” proxy is used in the literature (with nuances)

In LLaVA-style models, image features are inserted as a block of “image tokens” and the LLM uses self-attention over both text tokens and those visual tokens. Summing attention mass over the image-token range is a common proxy for “how much this step attends to image features”.

But strong hallucination mitigation papers typically do not just average all heads and all mid-layers—they often:

  • identify “image heads” (heads strongly attending to image tokens) and operate on them
  • suppress / mask selected heads
  • or use contrastive decoding variants guided by attention structure

Examples:

  • OPERA (decoding-time mitigation) (arXiv)
  • SPIN (image-guided head suppression; explicitly runs CHAIR eval and intervenes by layer/head ranges) (GitHub)
  • MaskCD (selects “image heads” and studies CHAIR sensitivity to that selection; shows head selection matters a lot)

Implication for your correlation test: averaging across heads/layers can wash out the signal if only a subset of heads carry grounding.

B. Weak correlation is plausible even if your attention code is “correct”

Even when a token is visually grounded, the model may:

  • attend to earlier text tokens that already absorbed visual evidence in previous layers/steps
  • use MLP-stored visual information rather than direct attention to image tokens at that step
  • distribute attention broadly across linguistic context (especially for function words around object mentions)

So “visual attention mass at token t” can be a noisy grounding proxy unless you:

  • restrict to object tokens (after fixing alignment)
  • focus on “image heads” (Ă  la MaskCD/SPIN)
  • examine per-layer or per-head effects instead of a grand mean

C. There is a real Transformers-specific pitfall with attn_implementation="eager" on LLaVA

There is a closed Transformers bug report specifically about LLaVA-1.5-7B + attn_implementation="eager" producing incorrect token IDs / generation corruption for some images/prompts. (GitHub)

If your generations are even occasionally corrupted under eager attention, both:

  • CHAIR results
  • attention–hallucination correlations

can become meaningless.

Sanity check (high value):

  • run the same prompts/images with default attention (no eager) and compare generated captions
  • if captions differ materially, your “attention run” is not a faithful baseline

Also, HF warns that LLaVA processors should carry patch_size, num_additional_image_tokens, and vision_feature_select_strategy to correctly infer/expand image tokens and avoid embedding merge failures/truncation issues. (Hugging Face)


4) Why your hallucination rate can also be genuinely lower than you expect (even after fixes)

Even with correct CHAIR, your setup tends to reduce hallucinations:

  1. Greedy decoding (argmax) usually hallucinates less than sampling.
  2. Short generation: you cap at 80 tokens. There is evidence (for LLaVA-1.5-7B) that CHAIR hallucination increases with longer max_new_tokens. (arXiv)
  3. Only 100 images: the variance of CHAIR on small subsets can be large.

So: don’t use the “5–15% CHAIR_i” intuition unless you match the decoding configuration used in those baselines.


5) Similar cases / issues online that match your failure modes

Eager attention causing odd outputs on LLaVA

  • Transformers Issue #35270: LLaVA v1.5 7B with attn_implementation="eager" can yield incorrect token IDs (including token id 0) and unstable behavior for some images/prompts. (GitHub)

Image token expansion / processor config pitfalls

  • HF docs warn that processors missing patch_size, num_additional_image_tokens, and vision_feature_select_strategy can cause incorrect <image> expansion / truncation / embedding merge failures. (Hugging Face)
  • There are also user reports about mismatch between image features and number of image tokens in LLaVA(-family) models. (GitHub)

Token index / special token mismatches

  • Reports of confusion around image_token_index / special token IDs across variants. (GitHub)

6) High-quality resources (papers, codebases, eval scripts) you can copy patterns from

Metrics / evaluation

  • CHAIR (Rohrbach et al., 2018) – defines object hallucination precisely using segmentation + captions. (ACL Anthology)
  • ALOHa (2024) – modernizes hallucination evaluation beyond CHAIR’s fixed object vocabulary. (arXiv)

Hallucination mitigation methods with code and CHAIR scripts

  • SPIN (EMNLP 2025) – image-guided head suppression; includes CHAIR eval commands for LLaVA-1.5. (GitHub)
  • ConVis (AAAI 2025) – contrastive decoding with hallucination visualization; provides CHAIR conversion/eval scripts. (GitHub)
  • MaskCD (Findings EMNLP 2025) – shows “image head” selection is critical; includes CHAIR ablations on LLaVA-1.5-7B.
  • CCA-LLaVA (NeurIPS 2024) – focuses on causal attention for object hallucination; references OPERA/CHAIR eval tooling. (GitHub)
  • AllPath (NeurIPS 2025) – unified mitigation across alignment formats; includes CHAIR evaluation. (GitHub)
  • PerturboLLaVA – explicitly acknowledges OPERA-based CHAIR evaluation code usage. (GitHub)
  • ReVisiT – provides CHAIR evaluation scripts and pins Transformers versions per model (useful for reproducibility). (GitHub)

Bottom line for your case

  1. Your token–word alignment is currently not measuring “object tokens” in the CHAIR sense, and it is very likely inflating Hits while missing many hallucinated object mentions.

  2. Your attention mass computation is a defensible proxy, but the way you aggregate (all heads + many layers) is exactly where multiple recent papers show the signal gets diluted; try per-head “image head” selection (MaskCD/SPIN-style) once alignment is fixed.

  3. Independently, verify that attn_implementation="eager" is not corrupting outputs on your environment—there is direct precedent for LLaVA-1.5-7B behaving oddly under eager attention. (GitHub)

If you want a minimal diagnostic that usually identifies the primary culprit in minutes: compare chair_results["overall_metrics"]["CHAIRi"] to your own len(halls)/(len(hits)+len(halls)). If they differ by a lot, the bug is almost entirely in alignment/label mapping (not the model).

You are the GOAT my friend!

1 Like