Woodpecker: hallucination correction for multimodal large language models

被引:0
作者
Yin, Shukang [1 ]
Fu, Chaoyou [2 ,3 ]
Zhao, Sirui [1 ]
Xu, Tong [1 ]
Wang, Hao [1 ]
Sui, Dianbo [4 ]
Shen, Yunhang [5 ]
Li, Ke [5 ]
Sun, Xing [5 ]
Chen, Enhong [1 ]
机构
[1] Univ Sci & Technol China, Sch Artificial Intelligence & Data Sci, Hefei 230026, Peoples R China
[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Peoples R China
[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou 215163, Peoples R China
[4] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[5] YouTu, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
multimodal learning; multimodal large language models; hallucination correction; large language models; vision and language;
D O I
10.1007/s11432-024-4251-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models (MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
    Zhang, Zhikai
    Li, Yitang
    Huang, Haofeng
    Lin, Mingxian
    Yi, Li
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 403 - 421
  • [32] Reasoning-Driven Food Energy Estimation via Multimodal Large Language Models
    Tanabe, Hikaru
    Yanai, Keiji
    NUTRIENTS, 2025, 17 (07)
  • [33] Leveraging Multimodal Large Language Models for Enhanced Learning and Application in Building Energy Modeling
    Labib, Rania
    MULTIPHYSICS AND MULTISCALE BUILDING PHYSICS, IBPC 2024, VOL 3, 2025, 554 : 611 - 618
  • [34] Panel: Multimodal Large Foundation Models
    Kankanhalli, Mohan
    Worring, Marcel
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9709 - 9709
  • [35] Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation
    Zhang, Bo
    Ma, Hui
    Ding, Jian
    Wang, Jian
    Xu, Bo
    Lin, Hongfei
    INFORMATION FUSION, 2025, 118
  • [36] Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
    Li, Yifan
    Guo, Hangyu
    Zhou, Kun
    Zhou, Wayne Xin
    Wen, Ji-Rong
    COMPUTER VISION - ECCV 2024, PT LXXIII, 2025, 15131 : 174 - 189
  • [37] Crack image classification and information extraction in steel bridges using multimodal large language models
    Wang, Xiao
    Yue, Qingrui
    Liu, Xiaogang
    AUTOMATION IN CONSTRUCTION, 2025, 171
  • [38] Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging
    Harrison, Rachel M.
    Dereventsov, Anton
    Bibin, Anton
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1535 - 1542
  • [39] Multimodal learning using large language models to improve transient identification of nuclear power plants
    Qi, Ben
    Sun, Jun
    Sui, Zhe
    Xiao, Xingyu
    Liang, Jingang
    PROGRESS IN NUCLEAR ENERGY, 2024, 177
  • [40] MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding
    Liu, Yi
    Hou, Haowen
    Ma, Fei
    Ni, Shiguang
    Yu, Fei Richard
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 281 - 285