PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

被引:44
作者
Lin, Weixiong [1 ]
Zhao, Ziheng [1 ]
Zhang, Xiaoman [1 ,2 ]
Wu, Chaoyi [1 ,2 ]
Zhang, Ya [1 ,2 ]
Wang, Yanfeng [1 ,2 ]
Xie, Weidi [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, Cooperat Medianet Innovat Ctr, Shanghai, Peoples R China
[2] Shanghai AI Lab, Shanghai, Peoples R China
来源
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VIII | 2023年 / 14227卷
基金
国家重点研发计划;
关键词
Multimodal Dataset; Vision-Language Pretraining;
D O I
10.1007/978-3-031-43993-3_51
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Foundation models trained on large-scale dataset gain a recent surge in CV and NLP. In contrast, development in biomedical domain lags far behind due to data scarcity. To address this issue, we build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess subset, which is 8 times larger than before, PMC-OA covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level, i.e., subfigure and subcaption. While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP outperform previous state-of-the-art models on various downstream tasks, including image-text retrieval on ROCO, MedMNIST image classification, Medical VQA, for example, +8.1% R@10 on image-text retrieval, +3.9% accuracy on image classification.
引用
收藏
页码:525 / 536
页数:12
相关论文
共 38 条
[1]   The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[2]  
Brown TB, 2020, ADV NEUR IN, V33
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]   Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [J].
Chen, Zhihong ;
Li, Guanbin ;
Wan, Xiang .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :5152-5161
[5]   Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training [J].
Chen, Zhihong ;
Du, Yuhao ;
Hu, Jinpeng ;
Liu, Yang ;
Li, Guanbin ;
Wan, Xiang ;
Chang, Tsung-Hui .
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 :679-689
[6]   DWT-CV: Dense weight transfer-based cross validation strategy for model selection in biomedical data analysis [J].
Cheng, Jianhong ;
Kuang, Hulin ;
Zhao, Qichang ;
Wang, Yahui ;
Xu, Lei ;
Liu, Jin ;
Wang, Jianxin .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 135 :20-29
[7]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[8]  
Ding M, 2022, Arxiv, DOI [arXiv:2204.14217, 10.48550/arXiv.2204.14217]
[9]   An Empirical Study of Training End-to-End Vision-and-Language Transformers [J].
Dou, Zi-Yi ;
Xu, Yichong ;
Gan, Zhe ;
Wang, Jianfeng ;
Wang, Shuohang ;
Wang, Lijuan ;
Zhu, Chenguang ;
Zhang, Pengchuan ;
Yuan, Lu ;
Peng, Nanyun ;
Liu, Zicheng ;
Zeng, Michael .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18145-18155
[10]   A Self-Adaptive Discriminative Autoencoder for Medical Applications [J].
Ge, Xiaolong ;
Qu, Yanpeng ;
Shang, Changjing ;
Yang, Longzhi ;
Shen, Qiang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) :8875-8886