Woodpecker: hallucination correction for multimodal large language models

被引：0

作者：

Yin, Shukang ^{[1
]}

Fu, Chaoyou ^{[2
,3
]}

Zhao, Sirui ^{[1
]}

Xu, Tong ^{[1
]}

Wang, Hao ^{[1
]}

Sui, Dianbo ^{[4
]}

Shen, Yunhang ^{[5
]}

Li, Ke ^{[5
]}

Sun, Xing ^{[5
]}

Chen, Enhong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Sch Artificial Intelligence & Data Sci, Hefei 230026, Peoples R China

[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Peoples R China

[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou 215163, Peoples R China

[4] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

[5] YouTu, Shanghai 200233, Peoples R China

来源：

SCIENCE CHINA-INFORMATION SCIENCES | 2024年 / 67卷 / 12期

基金：

中国国家自然科学基金;

关键词：

multimodal learning; multimodal large language models; hallucination correction; large language models; vision and language;

D O I：

10.1007/s11432-024-4251-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models (MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

引用

页数：13

共 50 条

[31] FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Zhang, Zhikai
Li, Yitang
Huang, Haofeng
Lin, Mingxian
Yi, Li
COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 403 - 421
[32] Reasoning-Driven Food Energy Estimation via Multimodal Large Language Models
Tanabe, Hikaru
Yanai, Keiji
NUTRIENTS, 2025, 17 (07)
[33] Leveraging Multimodal Large Language Models for Enhanced Learning and Application in Building Energy Modeling
Labib, Rania
MULTIPHYSICS AND MULTISCALE BUILDING PHYSICS, IBPC 2024, VOL 3, 2025, 554 : 611 - 618
[34] Panel: Multimodal Large Foundation Models
Kankanhalli, Mohan
Worring, Marcel
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9709 - 9709
[35] Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation
Zhang, Bo
Ma, Hui
Ding, Jian
Wang, Jian
Xu, Bo
Lin, Hongfei
INFORMATION FUSION, 2025, 118
[36] Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Li, Yifan
Guo, Hangyu
Zhou, Kun
Zhou, Wayne Xin
Wen, Ji-Rong
COMPUTER VISION - ECCV 2024, PT LXXIII, 2025, 15131 : 174 - 189
[37] Crack image classification and information extraction in steel bridges using multimodal large language models
Wang, Xiao
Yue, Qingrui
Liu, Xiaogang
AUTOMATION IN CONSTRUCTION, 2025, 171
[38] Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging
Harrison, Rachel M.
Dereventsov, Anton
Bibin, Anton
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1535 - 1542
[39] Multimodal learning using large language models to improve transient identification of nuclear power plants
Qi, Ben
Sun, Jun
Sui, Zhe
Xiao, Xingyu
Liang, Jingang
PROGRESS IN NUCLEAR ENERGY, 2024, 177
[40] MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding
Liu, Yi
Hou, Haowen
Ma, Fei
Ni, Shiguang
Yu, Fei Richard
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 281 - 285

← 1 2 3 4 5 →