MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model

被引:3
作者
Wang, Pengyu [1 ]
Zhang, Huaqi [2 ]
Yuan, Yixuan [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China
[2] Beijing Jiaotong Univ, Sch Comp Sci & Technol, Beijing 100091, Peoples R China
关键词
Task analysis; Collaboration; Adaptation models; Medical diagnostic imaging; Pipelines; Computational modeling; Pathology; Multi-modal prompting; prompt collaboration; vision-language model; medical reports and images;
D O I
10.1109/TMI.2024.3418408
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: https://github.com/CUHK-AIM-Group/MCPL.
引用
收藏
页码:4224 / 4235
页数:12
相关论文
共 63 条
[1]  
Ai W, 2025, IEEE T NEUR NET LEAR, V36, P4908, DOI 10.1109/TNNLS.2024.3367940
[2]  
Awadalla Anas, 2023, arXiv
[3]  
Bubeck Sebastien, 2023, arXiv
[4]   Understanding and Improving Visual Prompting: A Label-Mapping Perspective [J].
Chen, Aochuan ;
Yao, Yuguang ;
Chen, Pin-Yu ;
Zhang, Yihua ;
Liu, Sijia .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :19133-19143
[5]  
Chen Zhihao, 2023, arXiv
[6]   Decoupling Zero-Shot Semantic Segmentation [J].
Ding, Jian ;
Xue, Nan ;
Xia, Gui-Song ;
Dai, Dengxin .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :11573-11582
[7]  
Dosovitskiy A, 2020, INT C LEARN REPR
[8]   PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images [J].
Feng, Chengjian ;
Zhong, Yujie ;
Jie, Zequn ;
Chu, Xiangxiang ;
Ren, Haibing ;
Wei, Xiaolin ;
Xie, Weidi ;
Ma, Lin .
COMPUTER VISION, ECCV 2022, PT IX, 2022, 13669 :701-717
[9]   CLIP-Adapter: Better Vision-Language Models with Feature Adapters [J].
Gao, Peng ;
Geng, Shijie ;
Zhang, Renrui ;
Ma, Teli ;
Fang, Rongyao ;
Zhang, Yongfeng ;
Li, Hongsheng ;
Qiao, Yu .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) :581-595
[10]   Texts as Images in Prompt Tuning for Multi-Label Image Recognition [J].
Guo, Zixian ;
Dong, Bowen ;
Ji, Zhilong ;
Bai, Jinfeng ;
Guo, Yiwen ;
Zuo, Wangmeng .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :2808-2817