MKGF: A multi-modal knowledge graph based RAG framework to enhance LVLMs for Medical visual question answering

被引:0
作者
Wu, Yinan [1 ]
Lu, Yuming [1 ]
Zhou, Yan [1 ]
Ding, Yifan [2 ]
Liu, Jingping [1 ]
Ruan, Tong [1 ]
机构
[1] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai 200237, Peoples R China
[2] Fudan Univ, Zhongshan Hosp, Dept Crit Care Med, Shanghai 200032, Peoples R China
关键词
Multi-modal; Knowledge graph; Large language model; RETRIEVAL;
D O I
10.1016/j.neucom.2025.129999
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical visual question answering (MedVQA) is a challenging task that requires models to understand medical images and return accurate responses for the given questions. Most recent methods focus on transferring general-domain large vision-language models (LVLMs) to the medical domain by constructing medical instruction datasets and in-context learning. However, the performance of these methods are limited due to the hallucination issue of LVLMs. In addition, fine-tuning the abundant parameters of LVLMs on medical instruction datasets is high time and economic cost. Hence, we propose a MKGF framework that leverages a multi-modal medical knowledge graph (MMKG) to relieve the hallucination issue without fine-tuning the abundant parameters of LVLMs. Firstly, we employ a pre-trained text retriever to build question-knowledge relations on training set. Secondly, we train a multi-modal retriever with these relations. Finally, we use it to retrieve question-relevant knowledge and enhance the performance of LVLMs on the test set. To evaluate the effectiveness of MKGF, we conduct extensive experiments on two public datasets Slake and VQA-RAD. Our method improves the pre-trained SOTA LVLMs by 10.15% and 9.32%, respectively. The source codes are available at https://github.com/ehnal/MKGF.
引用
收藏
页数:10
相关论文
共 48 条
[1]   Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs [J].
Caffagni, Davide ;
Cocchi, Federico ;
Moratelli, Nicholas ;
Sarto, Sara ;
Cornia, Marcella ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, :1818-1826
[2]  
Chen J, 2024, Arxiv, DOI arXiv:2402.03216
[3]   MISS: A Generative Pre-training and Fine-Tuning Approach for Med-VQA [J].
Chen, Jiawei ;
Yang, Dingkang ;
Jiang, Yue ;
Lei, Yuxuan ;
Zhang, Lihua .
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VIII, 2024, 15023 :299-313
[4]  
Chen JY, 2024, 2024 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2024, P7346
[5]  
Chen T, 2020, PR MACH LEARN RES, V119
[6]  
Chen Y, 2023, 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), P14948
[7]  
Dettmers Tim, 2023, Advances in Neural Information Processing Systems
[8]  
Eslami S, 2023, 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, P1181
[9]  
FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619
[10]  
Gui LK, 2022, NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, P956