Multi-modal Adapter for Medical Vision-and-Language Learning

被引:1
|
作者
Yu, Zheng [1 ]
Qiao, Yanyuan [1 ]
Xie, Yutong [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia
来源
MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I | 2024年 / 14348卷
关键词
Medical Vision-and-Language Learning; Parameter-Efficient Transfer Learning; Multi-Modal Adapter; MODEL;
D O I
10.1007/978-3-031-45673-2_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, medical vision-and-language learning has attracted great attention from biomedical communities. Thanks to the development of large pre-trained models, the performances on these medical multi-modal learning benchmarks have been greatly improved. However, due to the rapid growth of the model size, full fine-tuning these large pre-trained models has become costly in training and storing such huge parameters for each downstream task. Thus, we propose a parameter-efficient transfer learning method named Medical Multi-Modal Adapter (M(3)AD) to mediate this problem. We select the state-of-the-art M(3)AE model as our baseline, which is pre-trained on 30k medical image-text pairs with multiple proxy tasks and has about 340M parameters. To be specific, we first insert general adapters after multi-head attention layers and feed-forward layers in all transformer blocks of M(3)AE. Then, we specifically design a modality-fusion adapter that adopts multi-head attention mechanisms and we insert them in the cross-modal encoder to enhance the multi-modal interactions. Compared to full fine-tuning, we freeze most parameters in M(3)AE and only train these inserted adapters with much smaller sizes. Extensive experimental results on three medical visual question answering datasets and one medical multi-modal classification dataset demonstrate the effectiveness of our proposed method, where M(3)AD achieves competitive performances compared to full fine-tuning with much fewer training parameters and memory consumption.
引用
收藏
页码:393 / 402
页数:10
相关论文
共 50 条
  • [1] Multi-modal adapter for RGB-T tracking
    Wang, He
    Xu, Tianyang
    Tang, Zhangyong
    Wu, Xiao-Jun
    Kittler, Josef
    INFORMATION FUSION, 2025, 118
  • [2] An overview of multi-modal medical image fusion
    Du, Jiao
    Li, Weisheng
    Lu, Ke
    Xiao, Bin
    NEUROCOMPUTING, 2016, 215 : 3 - 20
  • [3] The integration of information in a digital, multi-modal learning environment
    Schueler, Anne
    LEARNING AND INSTRUCTION, 2019, 59 : 76 - 87
  • [4] A Multi-Modal Deep Learning Approach for Emotion Recognition
    Shahzad, H. M.
    Bhatti, Sohail Masood
    Jaffar, Arfan
    Rashid, Muhammad
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 36 (02) : 1561 - 1570
  • [5] A deep learning based framework for the registration of three dimensional multi-modal medical images of the head
    Islam, Kh Tohidul
    Wijewickrema, Sudanthi
    O'Leary, Stephen
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [6] Contextual Information Driven Multi-modal Medical Image Fusion
    Luo, Xiao-Qing
    Zhang, Zhan-Cheng
    Zhang, Bao-Cheng
    Wu, Xiao-Jun
    IETE TECHNICAL REVIEW, 2017, 34 (06) : 598 - 611
  • [7] Multi-modal multi-head self-attention for medical VQA
    Joshi, Vasudha
    Mitra, Pabitra
    Bose, Supratik
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (14) : 42585 - 42608
  • [8] An Ensemble Learning Approach for Multi-Modal Medical Image Fusion using Deep Convolutional Neural Networks
    Maseleno, Andino
    Kavitha, D.
    Ashok, Koudegai
    Ansari, Mohammed Saleh Al
    Satheesh, Nimmati
    Reddy, R. Vijaya Kumar
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 758 - 769
  • [9] Interactive natural language acquisition in a multi-modal recurrent neural architecture
    Heinrich, Stefan
    Wermter, Stefan
    CONNECTION SCIENCE, 2018, 30 (01) : 99 - 133
  • [10] Multi-modal haptic image recognition based on deep learning
    Han, Dong
    Nie, Hong
    Chen, Jinbao
    Chen, Meng
    Deng, Zhen
    Zhang, Jianwei
    SENSOR REVIEW, 2018, 38 (04) : 486 - 493