DFF: InstructBLIP-based Explainable DeepFake Detection

📖 Model Description

This is the core DFF (DeepFake Detection and Forensic Explanation Framework) model as described in the ACL 2026 paper: "Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline".

DFF is built upon the InstructBLIP (Flan-T5 XL) architecture. By integrating the Face-ViT auxiliary classifier, it achieves state-of-the-art performance in both forgery localization (mask generation) and forensic explanation (captioning).

🌟 Key Capabilities

Forgery Localization: Generates high-resolution binary masks highlighting manipulated facial regions.
Natural Language Explanation: Produces detailed text describing why a specific image is considered a forgery (e.g., "The texture around the eyes is unnatural due to GAN-based blending").

🛠️ Model Details

Base LLM: Flan-T5 XL.
Visual Encoder: EVA-ViT-G.
Auxiliary Module: Face-ViT (Multi-label perception).
Task: Explainable Detection & Multi-modal Attribution Reporting.

🚀 Links

Official Code: Generating-Attribution-Reports
Auxiliary Classifier: LianJC/Face-ViT-MultiLabel
Dataset (MMTT): LianJC/MMTT-Dataset

📜 Citation

@inproceedings{lian2026generating,
  title={Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline},
  author={Lian, Jingchun and others},
  booktitle={Proceedings of ACL},
  year={2026},
  note={To appear}
}

Downloads last month: -; Downloads are not tracked for this model. How to track