MKGF: A multi-modal knowledge graph based RAG framework to enhance LVLMs for Medical visual question answering

被引：0

作者：

Wu, Yinan ^{[1
]}

Lu, Yuming ^{[1
]}

Zhou, Yan ^{[1
]}

Ding, Yifan ^{[2
]}

Liu, Jingping ^{[1
]}

Ruan, Tong ^{[1
]}

机构：

[1] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai 200237, Peoples R China

[2] Fudan Univ, Zhongshan Hosp, Dept Crit Care Med, Shanghai 200032, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 635卷

关键词：

Multi-modal; Knowledge graph; Large language model; RETRIEVAL;

D O I：

10.1016/j.neucom.2025.129999

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Medical visual question answering (MedVQA) is a challenging task that requires models to understand medical images and return accurate responses for the given questions. Most recent methods focus on transferring general-domain large vision-language models (LVLMs) to the medical domain by constructing medical instruction datasets and in-context learning. However, the performance of these methods are limited due to the hallucination issue of LVLMs. In addition, fine-tuning the abundant parameters of LVLMs on medical instruction datasets is high time and economic cost. Hence, we propose a MKGF framework that leverages a multi-modal medical knowledge graph (MMKG) to relieve the hallucination issue without fine-tuning the abundant parameters of LVLMs. Firstly, we employ a pre-trained text retriever to build question-knowledge relations on training set. Secondly, we train a multi-modal retriever with these relations. Finally, we use it to retrieve question-relevant knowledge and enhance the performance of LVLMs on the test set. To evaluate the effectiveness of MKGF, we conduct extensive experiments on two public datasets Slake and VQA-RAD. Our method improves the pre-trained SOTA LVLMs by 10.15% and 9.32%, respectively. The source codes are available at https://github.com/ehnal/MKGF.

引用

页数：10

共 48 条

[1] Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs [J].

Caffagni, Davide ;

Cocchi, Federico ;

Moratelli, Nicholas ;

Sarto, Sara ;

Cornia, Marcella ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, :1818-1826

[2]

Chen J, 2024, Arxiv, DOI arXiv:2402.03216

[3] MISS: A Generative Pre-training and Fine-Tuning Approach for Med-VQA [J].

Chen, Jiawei ;

Yang, Dingkang ;

Jiang, Yue ;

Lei, Yuxuan ;

Zhang, Lihua .

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VIII, 2024, 15023 :299-313

[4]

Chen JY, 2024, 2024 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2024, P7346

[5]

Chen T, 2020, PR MACH LEARN RES, V119

[6]

Chen Y, 2023, 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), P14948

[7]

Dettmers Tim, 2023, Advances in Neural Information Processing Systems

[8]

Eslami S, 2023, 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, P1181

[9]

FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619

[10]

Gui LK, 2022, NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, P956

← 1 2 3 4 5 →