Multi-modal Adapter for Medical Vision-and-Language Learning

被引：1

作者：

Yu, Zheng ^{[1
]}

Qiao, Yanyuan ^{[1
]}

Xie, Yutong ^{[1
]}

Wu, Qi ^{[1
]}

机构：

[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia

来源：

MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I | 2024年 / 14348卷

关键词：

Medical Vision-and-Language Learning; Parameter-Efficient Transfer Learning; Multi-Modal Adapter; MODEL;

D O I：

10.1007/978-3-031-45673-2_39

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, medical vision-and-language learning has attracted great attention from biomedical communities. Thanks to the development of large pre-trained models, the performances on these medical multi-modal learning benchmarks have been greatly improved. However, due to the rapid growth of the model size, full fine-tuning these large pre-trained models has become costly in training and storing such huge parameters for each downstream task. Thus, we propose a parameter-efficient transfer learning method named Medical Multi-Modal Adapter (M(3)AD) to mediate this problem. We select the state-of-the-art M(3)AE model as our baseline, which is pre-trained on 30k medical image-text pairs with multiple proxy tasks and has about 340M parameters. To be specific, we first insert general adapters after multi-head attention layers and feed-forward layers in all transformer blocks of M(3)AE. Then, we specifically design a modality-fusion adapter that adopts multi-head attention mechanisms and we insert them in the cross-modal encoder to enhance the multi-modal interactions. Compared to full fine-tuning, we freeze most parameters in M(3)AE and only train these inserted adapters with much smaller sizes. Extensive experimental results on three medical visual question answering datasets and one medical multi-modal classification dataset demonstrate the effectiveness of our proposed method, where M(3)AD achieves competitive performances compared to full fine-tuning with much fewer training parameters and memory consumption.

引用

页码：393 / 402

页数：10

共 50 条

[1] Multi-modal adapter for RGB-T tracking
Wang, He
Xu, Tianyang
Tang, Zhangyong
Wu, Xiao-Jun
Kittler, Josef
INFORMATION FUSION, 2025, 118
[2] An overview of multi-modal medical image fusion
Du, Jiao
Li, Weisheng
Lu, Ke
Xiao, Bin
NEUROCOMPUTING, 2016, 215 : 3 - 20
[3] The integration of information in a digital, multi-modal learning environment
Schueler, Anne
LEARNING AND INSTRUCTION, 2019, 59 : 76 - 87
[4] A Multi-Modal Deep Learning Approach for Emotion Recognition
Shahzad, H. M.
Bhatti, Sohail Masood
Jaffar, Arfan
Rashid, Muhammad
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 36 (02) : 1561 - 1570
[5] A deep learning based framework for the registration of three dimensional multi-modal medical images of the head
Islam, Kh Tohidul
Wijewickrema, Sudanthi
O'Leary, Stephen
SCIENTIFIC REPORTS, 2021, 11 (01)
[6] Contextual Information Driven Multi-modal Medical Image Fusion
Luo, Xiao-Qing
Zhang, Zhan-Cheng
Zhang, Bao-Cheng
Wu, Xiao-Jun
IETE TECHNICAL REVIEW, 2017, 34 (06) : 598 - 611
[7] Multi-modal multi-head self-attention for medical VQA
Joshi, Vasudha
Mitra, Pabitra
Bose, Supratik
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (14) : 42585 - 42608
[8] An Ensemble Learning Approach for Multi-Modal Medical Image Fusion using Deep Convolutional Neural Networks
Maseleno, Andino
Kavitha, D.
Ashok, Koudegai
Ansari, Mohammed Saleh Al
Satheesh, Nimmati
Reddy, R. Vijaya Kumar
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 758 - 769
[9] Interactive natural language acquisition in a multi-modal recurrent neural architecture
Heinrich, Stefan
Wermter, Stefan
CONNECTION SCIENCE, 2018, 30 (01) : 99 - 133
[10] Multi-modal haptic image recognition based on deep learning
Han, Dong
Nie, Hong
Chen, Jinbao
Chen, Meng
Deng, Zhen
Zhang, Jianwei
SENSOR REVIEW, 2018, 38 (04) : 486 - 493

← 1 2 3 4 5 →