Woodpecker: hallucination correction for multimodal large language models

被引:0
作者
Yin, Shukang [1 ]
Fu, Chaoyou [2 ,3 ]
Zhao, Sirui [1 ]
Xu, Tong [1 ]
Wang, Hao [1 ]
Sui, Dianbo [4 ]
Shen, Yunhang [5 ]
Li, Ke [5 ]
Sun, Xing [5 ]
Chen, Enhong [1 ]
机构
[1] Univ Sci & Technol China, Sch Artificial Intelligence & Data Sci, Hefei 230026, Peoples R China
[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Peoples R China
[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou 215163, Peoples R China
[4] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[5] YouTu, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
multimodal learning; multimodal large language models; hallucination correction; large language models; vision and language;
D O I
10.1007/s11432-024-4251-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models (MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Automatic Estimation for Visual Quality Changes of Street Space via Street-View Images and Multimodal Large Language Models
    Liang, Hao
    Zhang, Jiaxin
    Li, Yunqin
    Wang, Bowen
    Huang, Jingyong
    [J]. IEEE ACCESS, 2024, 12 : 87713 - 87727
  • [42] Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving
    Wang, Jue
    [J]. SYMMETRY-BASEL, 2024, 16 (09):
  • [43] Lexical Error Guard: Leveraging Large Language Models for Enhanced ASR Error Correction
    Si, Mei
    Cobas, Omar
    Fababeir, Michael
    [J]. MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (04): : 2435 - 2446
  • [44] CalorieVoL: Integrating Volumetric Context Into Multimodal Large Language Models for Image-Based Calorie Estimation
    Tanabe, Hikaru
    Yanai, Keiji
    [J]. MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 353 - 365
  • [45] Large language models, politics, and the functionalization of language
    Olya Kudina
    Bas de Boer
    [J]. AI and Ethics, 2025, 5 (3): : 2367 - 2379
  • [46] Exploring the prospects of multimodal large language models for Automated Emotion Recognition in education: Insights from Gemini
    Yu, Shuzhen
    Androsov, Alexey
    Yan, Hanbing
    [J]. COMPUTERS & EDUCATION, 2025, 232
  • [47] (In)forming the new building envelope: A pedagogical study in generative design with precedents and multimodal large language models
    Veloso, Pedro
    [J]. INTERNATIONAL JOURNAL OF ARCHITECTURAL COMPUTING, 2025, 23 (01) : 96 - 121
  • [48] Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models
    Li, Xinwei
    Lin, Li
    Wang, Shuai
    Qian, Chen
    [J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 882 - 892
  • [49] MULTIWAY-ADAPTER: ADAPTING MULTIMODAL LARGE LANGUAGE MODELS FOR SCALABLE IMAGE-TEXT RETRIEVAL
    Long, Zijun
    Killick, George
    McCreadie, Richard
    Camarasa, Gerardo Aragon
    [J]. 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6580 - 6584
  • [50] Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023)
    Nakaura, Takeshi
    Yoshida, Naofumi
    Kobayashi, Naoki
    Nagayama, Yasunori
    Uetani, Hiroyuki
    Kidoh, Masafumi
    Oda, Seitaro
    Funama, Yoshinori
    Hirai, Toshinori
    [J]. ACADEMIC RADIOLOGY, 2025, 32 (05) : 2394 - 2401