Woodpecker: hallucination correction for multimodal large language models

被引：0

作者：

Yin, Shukang ^{[1
]}

Fu, Chaoyou ^{[2
,3
]}

Zhao, Sirui ^{[1
]}

Xu, Tong ^{[1
]}

Wang, Hao ^{[1
]}

Sui, Dianbo ^{[4
]}

Shen, Yunhang ^{[5
]}

Li, Ke ^{[5
]}

Sun, Xing ^{[5
]}

Chen, Enhong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Sch Artificial Intelligence & Data Sci, Hefei 230026, Peoples R China

[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Peoples R China

[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou 215163, Peoples R China

[4] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

[5] YouTu, Shanghai 200233, Peoples R China

来源：

SCIENCE CHINA-INFORMATION SCIENCES | 2024年 / 67卷 / 12期

基金：

中国国家自然科学基金;

关键词：

multimodal learning; multimodal large language models; hallucination correction; large language models; vision and language;

D O I：

10.1007/s11432-024-4251-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models (MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

引用

页数：13

共 50 条

[41] Automatic Estimation for Visual Quality Changes of Street Space via Street-View Images and Multimodal Large Language Models
Liang, Hao
Zhang, Jiaxin
Li, Yunqin
Wang, Bowen
Huang, Jingyong
[J]. IEEE ACCESS, 2024, 12 : 87713 - 87727
[42] Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving
Wang, Jue
[J]. SYMMETRY-BASEL, 2024, 16 (09):
[43] Lexical Error Guard: Leveraging Large Language Models for Enhanced ASR Error Correction
Si, Mei
Cobas, Omar
Fababeir, Michael
[J]. MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (04): : 2435 - 2446
[44] CalorieVoL: Integrating Volumetric Context Into Multimodal Large Language Models for Image-Based Calorie Estimation
Tanabe, Hikaru
Yanai, Keiji
[J]. MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 353 - 365
[45] Large language models, politics, and the functionalization of language
Olya Kudina
Bas de Boer
[J]. AI and Ethics, 2025, 5 (3): : 2367 - 2379
[46] Exploring the prospects of multimodal large language models for Automated Emotion Recognition in education: Insights from Gemini
Yu, Shuzhen
Androsov, Alexey
Yan, Hanbing
[J]. COMPUTERS & EDUCATION, 2025, 232
[47] (In)forming the new building envelope: A pedagogical study in generative design with precedents and multimodal large language models
Veloso, Pedro
[J]. INTERNATIONAL JOURNAL OF ARCHITECTURAL COMPUTING, 2025, 23 (01) : 96 - 121
[48] Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models
Li, Xinwei
Lin, Li
Wang, Shuai
Qian, Chen
[J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 882 - 892
[49] MULTIWAY-ADAPTER: ADAPTING MULTIMODAL LARGE LANGUAGE MODELS FOR SCALABLE IMAGE-TEXT RETRIEVAL
Long, Zijun
Killick, George
McCreadie, Richard
Camarasa, Gerardo Aragon
[J]. 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6580 - 6584
[50] Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023)
Nakaura, Takeshi
Yoshida, Naofumi
Kobayashi, Naoki
Nagayama, Yasunori
Uetani, Hiroyuki
Kidoh, Masafumi
Oda, Seitaro
Funama, Yoshinori
Hirai, Toshinori
[J]. ACADEMIC RADIOLOGY, 2025, 32 (05) : 2394 - 2401

← 1 2 3 4 5 →