Context & Objective
I am running an experiment to validate the “Attention Gap” hypothesis for Hallucination Reduction.
-
Hypothesis: Hallucinated tokens (objects generated but not in the image) should have statistically lower Attention Mass on the visual tokens compared to “Hit” tokens (correct objects).
-
Goal: Establish a baseline point-biserial correlation between Visual Attention Mass and Hallucination Status (0/1) before applying a steering intervention.
Experimental Setup
-
Model:
llava-hf/llava-1.5-7b-hf(loaded infloat16). -
Attention Implementation: Forced
attn_implementation="eager"to supportoutput_attentions=True. -
Dataset: MSCOCO 2014 Validation Set (Standard split for CHAIR metric).
-
Metric: CHAIR (Caption Hallucination Assessment with Image Relevance), adapted to map individual tokens to “Hit” or “Hallucination” spans.
The Implementation Logic
I am manually decoding token-by-token. At each step t:
-
Visual Slice: I locate the
<image>tokens in the input. For LLaVA-1.5, I confirm exactly 576 visual tokens are present. -
Attention Mass Calculation:
-
I extract
outputs.attentions(tuple of 32 layers). -
I select Layers 10–25 (Mid-layers, where visual-semantic alignment typically happens).
-
I sum the attention weights over the 576 visual tokens for the last generated token q_t.
-
I mean-pool across the selected layers and heads.
-
-
Alignment: I use the CHAIR metric to identify which words in the caption are hallucinations. I then map the generated tokens to these words using character-span alignment.
The Problem / Anomalies
After analyzing ~7,200 object tokens, my results are statistically suspicious:
-
Extremely Low Hallucination Rate (0.8%):
-
Result: Out of 7,218 object tokens, only 62 were classified as hallucinations.
-
Expectation: Standard LLaVA-1.5 usually has a CHAIR$_i$ score between 5%–15% on COCO. A 0.8% rate implies the model is performing impossibly well, or my evaluation is missing nearly all hallucinations.
-
-
Weak Significance:
-
Result:
Mean Mass (Hit): 0.0912vsMean Mass (Hall): 0.0819. -
Correlation: r = -0.0187 (p=0.11).
-
While the “gap” exists (Hits have higher mass), the p-value is insignificant, likely due to the extreme class imbalance (only 62 negative samples).
-
My Code Implementation
Below is the core logic for extracting attention and aligning tokens. I suspect I might be calculating the mass incorrectly or missing a subtlety in how eager attention handles the KV cache for visual tokens.
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
import os
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from tqdm import tqdm
from PIL import Image
import sys
import re
# --- Setup Paths ---
PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(PROJECT_DIR)
from data_setup import load_llava_model
from utils.chair import CHAIR
class Experiment1_Fixed:
def __init__(self, model, processor, coco_img_dir, output_dir="results_exp1_fixed"):
self.model = model
self.processor = processor
self.coco_img_dir = coco_img_dir
self.output_dir = output_dir
os.makedirs(output_dir, exist_ok=True)
if hasattr(self.model.config, "image_token_index"):
self.image_token_id = self.model.config.image_token_index
else:
self.image_token_id = 32000
self._printed_debug = False
def compute_attention_mass(self, attentions, start_idx, end_idx):
"""
Sums attention mass over the visual tokens (start_idx to end_idx).
Averages across layers 10-25 (Deep visual semantic layers).
"""
# LLaVA-7B Visual Layers (Middle of the network)
target_layers = range(10, 26)
total_mass = 0.0
valid_layers = 0
for i in target_layers:
if i >= len(attentions): break
# Shape: (batch, heads, 1, seq_len) -> (heads, seq_len)
attn_matrix = attentions[i][0, :, -1, :]
# Safety check
if end_idx > attn_matrix.shape[-1]: continue
# We want total attention paid to the image
layer_mass = attn_matrix[:, start_idx:end_idx].sum(dim=-1).mean().item()
total_mass += layer_mass
valid_layers += 1
if valid_layers == 0: return 0.0
return total_mass / valid_layers
@torch.inference_mode()
def run_inference(self, image_files, num_images=100):
traces = []
print(f"Running inference on {num_images} images...")
for img_filename in tqdm(image_files[:num_images]):
image_path = os.path.join(self.coco_img_dir, img_filename)
try:
image = Image.open(image_path).convert("RGB")
except:
continue
# Parse Image ID
try:
image_id = int(img_filename.split('_')[-1].split('.')[0])
except:
continue
prompt = "USER: <image>\nDescribe this image in detail. ASSISTANT:"
# Processor automatically expands <image> into 576 tokens in input_ids
inputs = self.processor(text=prompt, images=image, return_tensors="pt")
input_ids = inputs.input_ids.to(self.model.device)
pixel_values = inputs.pixel_values.to(self.model.device, dtype=torch.float16)
# --- PREFILL ---
prefill_out = self.model(
input_ids=input_ids,
pixel_values=pixel_values,
use_cache=True,
return_dict=True
)
past_key_values = prefill_out.past_key_values
# Find all indices where input is <image> (32000)
image_indices = (input_ids[0] == self.image_token_id).nonzero(as_tuple=True)[0]
if len(image_indices) == 0:
continue
# The start is the first occurrence
start_idx = image_indices[0].item()
# The end is the last occurrence + 1
end_idx = image_indices[-1].item() + 1
# SANITY CHECK: LLaVA-1.5 should have exactly 576 visual tokens
visual_count = end_idx - start_idx
if visual_count != 576:
print(f"\n[WARNING] Found {visual_count} visual tokens instead of 576. Skipping.")
continue
if not self._printed_debug:
print(f"\n--- DEBUG CONFIRMED ---")
print(f"Image Tokens Start: {start_idx}")
print(f"Image Tokens End: {end_idx}")
print(f"Total Visual Mass Range: {visual_count} tokens")
print(f"-----------------------\n")
self._printed_debug = True
# --- DECODING LOOP ---
next_token = torch.argmax(prefill_out.logits[:, -1, :], dim=-1).unsqueeze(1)
generated_ids = []
token_trace = []
# Using 'full_prefix' logic for alignment
current_prefix = self.processor.decode(input_ids[0], skip_special_tokens=True)
for _ in range(80):
outputs = self.model(
input_ids=next_token,
past_key_values=past_key_values,
use_cache=True,
output_attentions=True
)
# Calculate Mass
mass = self.compute_attention_mass(outputs.attentions, start_idx, end_idx)
token_id = next_token.item()
generated_ids.append(token_id)
# We track the text growing step by step
token_text = self.processor.decode([token_id], skip_special_tokens=True)
# Update prefix for next step alignment
trace_entry = {
"token": token_text,
"mass": mass,
"span_start": len(current_prefix),
"span_end": len(current_prefix) + len(token_text)
}
current_prefix += token_text
token_trace.append(trace_entry)
past_key_values = outputs.past_key_values
next_token = torch.argmax(outputs.logits[:, -1, :], dim=-1).unsqueeze(1)
if next_token.item() == self.processor.tokenizer.eos_token_id:
break
full_caption = self.processor.decode(generated_ids, skip_special_tokens=True)
traces.append({
"image_id": image_id,
"caption": full_caption,
"trace": token_trace
})
return traces
def align_and_evaluate(self, traces, chair_evaluator):
print("\nRunning CHAIR evaluation...")
# Save temp file for CHAIR
chair_input = [{"image_id": t["image_id"], "caption": t["caption"]} for t in traces]
temp_file = os.path.join(self.output_dir, "temp_caps.json")
with open(temp_file, "w") as f:
json.dump(chair_input, f)
# Run CHAIR
chair_results = chair_evaluator.compute_chair(temp_file, "image_id", "caption")
print("CHAIR metrics:", chair_results.get("overall_metrics", {}))
hits_mass = []
halls_mass = []
for i, res in enumerate(chair_results["sentences"]):
caption = traces[i]["caption"]
trace = traces[i]["trace"]
# Get Raw Words
hall_words = set([h[0].lower() for h in res["mscoco_hallucinated_words"]])
object_words = set([w.lower() for w in res["words"]]) # All objects detected
# Find Spans in Caption
word_spans = []
for word in object_words:
# Use simple find for robustness or regex
start = 0
while True:
idx = caption.lower().find(word, start)
if idx == -1: break
word_spans.append({
"start": idx,
"end": idx + len(word),
"type": "HALL" if word in hall_words else "HIT"
})
start = idx + len(word)
# Map tokens to spans
for token in trace:
if not token["token"].strip(): continue
t_start = token["span_start"]
t_end = token["span_end"]
t_mass = token["mass"]
# Check overlaps
for span in word_spans:
# Logic: Do the token and the word overlap significantly?
overlap_start = max(t_start, span["start"])
overlap_end = min(t_end, span["end"])
if overlap_start < overlap_end:
if span["type"] == "HALL":
halls_mass.append(t_mass)
else:
hits_mass.append(t_mass)
# Break ensures we don't double count one token for multiple overlaps
# (though rare with non-overlapping objects)
break
return hits_mass, halls_mass
def analyze(self, hits, halls):
hits = np.array(hits)
halls = np.array(halls)
print("\n--- FINAL CORRECTED RESULTS ---")
print(f"Total tokens analyzed: {len(hits) + len(halls)}")
print(f"Hits: {len(hits)}")
print(f"Hallucinations: {len(halls)}")
if len(halls) < 5 or len(hits) < 5:
print("Insufficient data for statistics.")
return
print(f"Mean Mass (Hit): {hits.mean():.4f}")
print(f"Mean Mass (Hall): {halls.mean():.4f}")
# Point Biserial
y = np.concatenate([np.zeros(len(hits)), np.ones(len(halls))])
x = np.concatenate([hits, halls])
corr, p = stats.pointbiserialr(y, x)
print(f"Correlation: {corr:.4f}")
print(f"P-value: {p:.4e}")
# Plot
plt.figure(figsize=(10, 6))
sns.kdeplot(hits, label=f"Hits (Mean: {hits.mean():.2f})", fill=True, color='blue')
sns.kdeplot(halls, label=f"Hallucinations (Mean: {halls.mean():.2f})", fill=True, color='red')
plt.title(f"Experiment 1: Attention Gap (Corr: {corr:.3f})")
plt.xlabel("Attention Mass (Sum over 576 Visual Tokens)")
plt.legend()
plt.savefig(os.path.join(self.output_dir, "exp1_final_plot.png"))
print(f"Plot saved to {self.output_dir}")
if __name__ == "__main__":
# Load Model
model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation="eager",
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
# Check paths
anno_dir = os.path.join(os.getcwd(), 'data/mscoco/annotations')
img_dir = os.path.join(os.getcwd(), 'data/mscoco/val2014')
if not os.path.exists(anno_dir) or not os.path.exists(img_dir):
print("Error: Data not found. Please run data_setup.py")
sys.exit(1)
evaluator = CHAIR(anno_dir)
exp = Experiment1_Fixed(model, processor, img_dir)
# Get images
all_images = sorted([f for f in os.listdir(img_dir) if f.endswith(".jpg")])
# Run
traces = exp.run_inference(all_images, num_images=100)
hits, halls = exp.align_and_evaluate(traces, evaluator)
exp.analyze(hits, halls)
utils/chair.py
'''
Copied from: https://github.com/LisaAnne/Hallucination/blob/master/utils/chair.py
Modified by: Maxlinn
1. adapt calculation of CHAIR-i and CHAIR-s for Python3, supports for both json and jsonl file input.
2. integrate synonyms.txt to make the script standalone.
3. remove machine-translation based metrics BLEU-n, CIDEr, ROGUE
4. add new metric Recall, which represents the node words(i.e. lemmas of objects) coverage overall.
5. add pickle cache mechanism to make it fast for repetitive evaluations.
'''
import os
import sys
import nltk
import json
from pattern.en import singularize
import argparse
import tqdm
import pickle
from collections import defaultdict
# copied from: https://github.com/LisaAnne/Hallucination/blob/master/data/synonyms.txt
synonyms_txt = '''
person, girl, boy, man, woman, kid, child, chef, baker, people, adult, rider, children, baby, worker, passenger, sister, biker, policeman, cop, officer, lady, cowboy, bride, groom, male, female, guy, traveler, mother, father, gentleman, pitcher, player, skier, snowboarder, skater, skateboarder, person, woman, guy, foreigner, child, gentleman, caller, offender, coworker, trespasser, patient, politician, soldier, grandchild, serviceman, walker, drinker, doctor, bicyclist, thief, buyer, teenager, student, camper, driver, solider, hunter, shopper, villager
bicycle, bike, bicycle, bike, unicycle, minibike, trike
car, automobile, van, minivan, sedan, suv, hatchback, cab, jeep, coupe, taxicab, limo, taxi
motorcycle, scooter, motor bike, motor cycle, motorbike, scooter, moped
airplane, jetliner, plane, air plane, monoplane, aircraft, jet, jetliner, airbus, biplane, seaplane
bus, minibus, trolley
train, locomotive, tramway, caboose
truck, pickup, lorry, hauler, firetruck
boat, ship, liner, sailboat, motorboat, dinghy, powerboat, speedboat, canoe, skiff, yacht, kayak, catamaran, pontoon, houseboat, vessel, rowboat, trawler, ferryboat, watercraft, tugboat, schooner, barge, ferry, sailboard, paddleboat, lifeboat, freighter, steamboat, riverboat, battleship, steamship
traffic light, street light, traffic signal, stop light, streetlight, stoplight
fire hydrant, hydrant
stop sign
parking meter
bench, pew
bird, ostrich, owl, seagull, goose, duck, parakeet, falcon, robin, pelican, waterfowl, heron, hummingbird, mallard, finch, pigeon, sparrow, seabird, osprey, blackbird, fowl, shorebird, woodpecker, egret, chickadee, quail, bluebird, kingfisher, buzzard, willet, gull, swan, bluejay, flamingo, cormorant, parrot, loon, gosling, waterbird, pheasant, rooster, sandpiper, crow, raven, turkey, oriole, cowbird, warbler, magpie, peacock, cockatiel, lorikeet, puffin, vulture, condor, macaw, peafowl, cockatoo, songbird
cat, kitten, feline, tabby
dog, puppy, beagle, pup, chihuahua, schnauzer, dachshund, rottweiler, canine, pitbull, collie, pug, terrier, poodle, labrador, doggie, doberman, mutt, doggy, spaniel, bulldog, sheepdog, weimaraner, corgi, cocker, greyhound, retriever, brindle, hound, whippet, husky
horse, colt, pony, racehorse, stallion, equine, mare, foal, palomino, mustang, clydesdale, bronc, bronco
sheep, lamb, ram, lamb, goat, ewe
cow, cattle, oxen, ox, calf, cattle, holstein, heifer, buffalo, bull, zebu, bison
elephant
bear, panda
zebra
giraffe
backpack, knapsack
umbrella
handbag, wallet, purse, briefcase
tie, bow, bow tie
suitcase, suit case, luggage
frisbee
skis, ski
snowboard
sports ball, ball
kite
baseball bat
baseball glove
skateboard
surfboard, longboard, skimboard, shortboard, wakeboard
tennis racket, racket
bottle
wine glass
cup
fork
knife, pocketknife, knive
spoon
bowl, container
banana
apple
sandwich, burger, sub, cheeseburger, hamburger
orange
broccoli
carrot
hot dog
pizza
donut, doughnut, bagel
cake, cheesecake, cupcake, shortcake, coffeecake, pancake
chair, seat, stool
couch, sofa, recliner, futon, loveseat, settee, chesterfield
potted plant, houseplant
bed
dining table, table, desk
toilet, urinal, commode, toilet, lavatory, potty
tv, monitor, televison, television
laptop, computer, notebook, netbook, lenovo, macbook, laptop computer
mouse
remote
keyboard
cell phone, mobile phone, phone, cellphone, telephone, phon, smartphone, iPhone
microwave
oven, stovetop, stove, stove top oven
toaster
sink
refrigerator, fridge, fridge, freezer
book
clock
vase
scissors
teddy bear, teddybear
hair drier, hairdryer
toothbrush
'''
def combine_coco_captions(annotation_path):
if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'val')):
raise Exception("Please download MSCOCO caption annotations for val set")
if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'train')):
raise Exception("Please download MSCOCO caption annotations for train set")
val_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'val')))
train_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'train')))
all_caps = {'info': train_caps['info'],
'licenses': train_caps['licenses'],
'images': val_caps['images'] + train_caps['images'],
'annotations': val_caps['annotations'] + train_caps['annotations']}
return all_caps
def combine_coco_instances(annotation_path):
if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'val')):
raise Exception("Please download MSCOCO instance annotations for val set")
if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'train')):
raise Exception("Please download MSCOCO instance annotations for train set")
val_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'val')))
train_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'train')))
all_instances = {'info': train_instances['info'],
'licenses': train_instances['licenses'],
'type': train_instances['licenses'],
'categories': train_instances['categories'],
'images': train_instances['images'] + val_instances['images'],
'annotations': val_instances['annotations'] + train_instances['annotations']}
return all_instances
class CHAIR(object):
def __init__(self, coco_path):
self.imid_to_objects = defaultdict(list) # later become a dict of sets
self.coco_path = coco_path
#read in synonyms
synonyms = synonyms_txt.splitlines()
synonyms = [s.strip().split(', ') for s in synonyms]
self.mscoco_objects = [] #mscoco objects and *all* synonyms
self.inverse_synonym_dict = {}
for synonym in synonyms:
self.mscoco_objects.extend(synonym)
for s in synonym:
self.inverse_synonym_dict[s] = synonym[0]
#Some hard coded rules for implementing CHAIR metrics on MSCOCO
#common 'double words' in MSCOCO that should be treated as a single word
coco_double_words = ['motor bike', 'motor cycle', 'air plane', 'traffic light', 'street light', 'traffic signal', 'stop light', 'fire hydrant', 'stop sign', 'parking meter', 'suit case', 'sports ball', 'baseball bat', 'baseball glove', 'tennis racket', 'wine glass', 'hot dog', 'cell phone', 'mobile phone', 'teddy bear', 'hair drier', 'potted plant', 'bow tie', 'laptop computer', 'stove top oven', 'hot dog', 'teddy bear', 'home plate', 'train track']
#Hard code some rules for special cases in MSCOCO
#qualifiers like 'baby' or 'adult' animal will lead to a false fire for the MSCOCO object 'person'. 'baby bird' --> 'bird'.
animal_words = ['bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'animal', 'cub']
#qualifiers like 'passenger' vehicle will lead to a false fire for the MSCOCO object 'person'. 'passenger jet' --> 'jet'.
vehicle_words = ['jet', 'train']
#double_word_dict will map double words to the word they should be treated as in our analysis
self.double_word_dict = {}
for double_word in coco_double_words:
self.double_word_dict[double_word] = double_word
for animal_word in animal_words:
self.double_word_dict['baby %s' %animal_word] = animal_word
self.double_word_dict['adult %s' %animal_word] = animal_word
for vehicle_word in vehicle_words:
self.double_word_dict['passenger %s' %vehicle_word] = vehicle_word
self.double_word_dict['bow tie'] = 'tie'
self.double_word_dict['toilet seat'] = 'toilet'
self.double_word_dict['wine glas'] = 'wine glass'
self.get_annotations()
def _load_generated_captions_into_evaluator(self, cap_file, image_id_key, caption_key):
'''
Meant to save time so imid_to_objects does not always need to be recomputed.
'''
#Read in captions
self.caps, self.eval_imids = load_generated_captions(cap_file, image_id_key, caption_key)
assert len(self.caps) == len(self.eval_imids)
def caption_to_words(self, caption):
'''
Input: caption
Output: MSCOCO words in the caption
'''
#standard preprocessing
words = nltk.word_tokenize(caption.lower())
words = [singularize(w) for w in words]
#replace double words
i = 0
double_words = []
idxs = []
while i < len(words):
idxs.append(i)
double_word = ' '.join(words[i:i+2])
if double_word in self.double_word_dict:
double_words.append(self.double_word_dict[double_word])
i += 2
else:
double_words.append(words[i])
i += 1
words = double_words
#toilet seat is not chair (sentences like "the seat of the toilet" will fire for "chair" if we do not include this line)
if ('toilet' in words) & ('seat' in words): words = [word for word in words if word != 'seat']
#get synonyms for all words in the caption
idxs = [idxs[idx] for idx, word in enumerate(words) \
if word in set(self.mscoco_objects)]
words = [word for word in words if word in set(self.mscoco_objects)]
node_words = []
for word in words:
node_words.append(self.inverse_synonym_dict[word])
#return all the MSCOCO objects in the caption
return words, node_words, idxs, double_words
def get_annotations_from_segments(self):
'''
Add objects taken from MSCOCO segmentation masks
'''
coco_segments = combine_coco_instances(self.coco_path )
segment_annotations = coco_segments['annotations']
#make dict linking object name to ids
id_to_name = {} #dict with id to synsets
for cat in coco_segments['categories']:
id_to_name[cat['id']] = cat['name']
for i, annotation in enumerate(segment_annotations):
sys.stdout.write("\rGetting annotations for %d/%d segmentation masks"
%(i, len(segment_annotations)))
imid = annotation['image_id']
node_word = self.inverse_synonym_dict[id_to_name[annotation['category_id']]]
self.imid_to_objects[imid].append(node_word)
print("\n")
def get_annotations_from_captions(self):
'''
Add objects taken from MSCOCO ground truth captions
'''
coco_caps = combine_coco_captions(self.coco_path)
caption_annotations = coco_caps['annotations']
for i, annotation in enumerate(caption_annotations):
sys.stdout.write('\rGetting annotations for %d/%d ground truth captions'
%(i, len(coco_caps['annotations'])))
imid = annotation['image_id']
_, node_words, _, _ = self.caption_to_words(annotation['caption'])
# note here is update, so call get_annotations_from_segments first
self.imid_to_objects[imid].extend(node_words)
print("\n")
def get_annotations(self):
'''
Get annotations from both segmentation and captions. Need both annotation types for CHAIR metric.
'''
self.get_annotations_from_segments()
self.get_annotations_from_captions()
# deduplicate
for imid in self.imid_to_objects:
self.imid_to_objects[imid] = set(self.imid_to_objects[imid])
def compute_chair(self, cap_file, image_id_key, caption_key):
'''
Given ground truth objects and generated captions, determine which sentences have hallucinated words.
'''
self._load_generated_captions_into_evaluator(cap_file, image_id_key, caption_key)
imid_to_objects = self.imid_to_objects
caps = self.caps
eval_imids = self.eval_imids
num_caps = 0.
num_hallucinated_caps = 0.
hallucinated_word_count = 0.
coco_word_count = 0.
# :add:
num_recall_gt_objects = 0.
num_gt_objects = 0.
output = {'sentences': []}
for i in tqdm.trange(len(caps)):
cap :str = caps[i]
imid :int = eval_imids[i]
#get all words in the caption, as well as corresponding node word
words, node_words, idxs, raw_words = self.caption_to_words(cap)
gt_objects = imid_to_objects[imid]
cap_dict = {'image_id': imid,
'caption': cap,
'mscoco_hallucinated_words': [],
'mscoco_gt_words': list(gt_objects),
'mscoco_generated_words': list(node_words),
'hallucination_idxs': [],
'words': raw_words
}
# :add:
cap_dict['metrics'] = {'CHAIRs': 0,
'CHAIRi': 0,
'Recall': 0}
#count hallucinated words
coco_word_count += len(node_words)
hallucinated = False
# add
recall_gt_objects = set()
for word, node_word, idx in zip(words, node_words, idxs):
if node_word not in gt_objects:
hallucinated_word_count += 1
cap_dict['mscoco_hallucinated_words'].append((word, node_word))
cap_dict['hallucination_idxs'].append(idx)
hallucinated = True
else:
recall_gt_objects.add(node_word)
#count hallucinated caps
num_caps += 1
if hallucinated:
num_hallucinated_caps += 1
# add
num_gt_objects += len(gt_objects)
num_recall_gt_objects += len(recall_gt_objects)
cap_dict['metrics']['CHAIRs'] = int(hallucinated)
cap_dict['metrics']['CHAIRi'] = 0.
cap_dict['metrics']['Recall'] = 0.
if len(words) > 0:
cap_dict['metrics']['CHAIRi'] = len(cap_dict['mscoco_hallucinated_words'])/float(len(words))
# add
if len(gt_objects) > 0:
cap_dict['metrics']['Recall'] = len(recall_gt_objects) / len(gt_objects)
output['sentences'].append(cap_dict)
chair_s = (num_hallucinated_caps/num_caps)
chair_i = (hallucinated_word_count/coco_word_count)
# add
recall = num_recall_gt_objects / num_gt_objects
output['overall_metrics'] = {'CHAIRs': chair_s,
'CHAIRi': chair_i,
'Recall': recall}
return output
def load_generated_captions(cap_file, image_id_key:str, caption_key:str):
#Read in captions
# it should be list of dict
ext = os.path.splitext(cap_file)[-1]
if ext == '.json':
caps = json.load(open(cap_file))
elif ext == '.jsonl':
caps = [json.loads(s) for s in open(cap_file)]
else:
raise ValueError(f'Unspported extension {ext} for cap_file: {cap_file}')
# list of int
imids = [obj[image_id_key] for obj in caps]
# list of str
caps = [obj[caption_key] for obj in caps]
return caps, imids
def save_hallucinated_words(cap_file, cap_dict):
with open(cap_file, 'w') as f:
json.dump(cap_dict, f, indent=2, ensure_ascii=False)
def print_metrics(hallucination_cap_dict, quiet=False):
sentence_metrics = hallucination_cap_dict['overall_metrics']
for k, v in sentence_metrics.items():
k_str = str(k).ljust(10)
v_str = f'{v * 100:.01f}'
print(k_str, v_str, sep=': ')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--cap_file", type=str, default='',
help="path towards json or jsonl saving image ids and their captions in list of dict.")
parser.add_argument("--image_id_key", type=str, default="image_id",
help="in each dict of cap_file, which key stores image id of coco.")
parser.add_argument("--caption_key", type=str, default="caption",
help="in each dict of cap_file, which key stores caption of the image.")
parser.add_argument("--cache", type=str, default="chair.pkl",
help="pre inited CHAIR evaluator object, for fast loading.")
parser.add_argument("--coco_path", type=str, default='coco_annotations',
help="only use for regenerating CHAIR evaluator object, will be ignored if uses cached evaluator.")
parser.add_argument("--save_path", type=str, default="",
help="saving CHAIR evaluate and results to json, useful for debugging the caption model.")
args = parser.parse_args()
if args.cache and os.path.exists(args.cache):
evaluator = pickle.load(open(args.cache, 'rb'))
print(f"loaded evaluator from cache: {args.cache}")
else:
print(f"cache not setted or not exist yet, building from scratch...")
evaluator = CHAIR(args.coco_path)
pickle.dump(evaluator, open(args.cache, 'wb'))
print(f"cached evaluator to: {args.cache}")
cap_dict = evaluator.compute_chair(args.cap_file, args.image_id_key, args.caption_key)
print_metrics(cap_dict)
if args.save_path:
save_hallucinated_words(args.save_path, cap_dict)
-
Logic Check: Is the
align_and_evaluatelogic flawed in a way that ignores partial matches or synonyms, leading to the low hallucination count? -
Attention Validity: Is calculating the
sum()of attention weights over the visual tokens (specifically layers 10-25) the standard proxy for “Visual Grounding” in LLaVA?Any insights on why the hallucination capture rate is so low or if the attention extraction logic misses the mark would be greatly appreciated