M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization

被引:16
|
作者
Liu, Che [1 ,2 ]
Cheng, Sibo [2 ,3 ]
Chen, Chen [3 ,5 ]
Qiao, Mengyun [2 ,4 ]
Zhang, Weitong [3 ]
Shah, Anand [6 ,7 ]
Bai, Wenjia [2 ,3 ,4 ]
Arcucci, Rossella [1 ,2 ]
机构
[1] Imperial Coll London, Dept Earth Sci & Engn, London, England
[2] Imperial Coll London, Data Sci Inst, London, England
[3] Imperial Coll London, Dept Comp, London, England
[4] Imperial Coll London, Dept Brain Sci, London, England
[5] Univ Oxford, Dept Engn Sci, Oxford, England
[6] Imperial Coll London, Dept Infect Dis Epidemiol, London, England
[7] Royal Brompton & Harefield Hosp, London, England
基金
英国工程与自然科学研究理事会;
关键词
Vision-language model; Vision-language pre-training; Self-supervised learning;
D O I
10.1007/978-3-031-43907-0_61
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100% of the data. The code can be found in https://github.com/cheliu-computation/M-FLAG-MICCAI2023.
引用
收藏
页码:637 / 647
页数:11
相关论文
共 50 条
  • [31] Patch is enough: naturalistic adversarial patch against vision-language pre-training models
    Dehong Kong
    Siyuan Liang
    Xiaopeng Zhu
    Yuansheng Zhong
    Wenqi Ren
    Visual Intelligence, 2 (1):
  • [32] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
    Ruan, Shouwei
    Dong, Yinpeng
    Liu, Hanging
    Huang, Yao
    Su, Hang
    Wei, Xingxing
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 309 - 327
  • [33] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
    Dou, Zi-Yi
    Kamath, Aishwarya
    Gan, Zhe
    Zhang, Pengchuan
    Wang, Jianfeng
    Li, Linjie
    Liu, Zicheng
    Liu, Ce
    LeCun, Yann
    Peng, Nanyun
    Gao, Jianfeng
    Wang, Lijuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [34] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
    Gan, Zhe
    Li, Linjie
    Li, Chunyuan
    Wang, Lijuan
    Liu, Zicheng
    Gao, Jianfeng
    FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352
  • [35] Position-guided Text Prompt for Vision-Language Pre-training
    Wang, Jinpeng
    Zhou, Pan
    Shou, Mike Zheng
    Yan, Shuicheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
  • [36] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Jin, Linbo
    Chen, Ben
    Zhou, Haoming
    Qiu, Minghui
    Shao, Ling
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
  • [37] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
    Liu, Che
    Cheng, Sibo
    Shi, Miaojing
    Shah, Anand
    Bai, Wenjia
    Arcucci, Rossella
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (01) : 519 - 529
  • [38] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [39] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [40] Fine-Grained Semantically Aligned Vision-Language Pre-Training
    Li, Juncheng
    He, Xin
    Wei, Longhui
    Qian, Long
    Zhu, Linchao
    Xie, Lingxi
    Zhuang, Yueting
    Tian, Qi
    Tang, Siliang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,