Dynamic Contrastive Distillation for Image-Text Retrieval

被引:21
作者
Rao, Jun [1 ]
Ding, Liang [4 ]
Qi, Shuhan [2 ,3 ]
Fang, Meng [5 ]
Liu, Yang [1 ]
Shen, Li [4 ]
Tao, Dacheng [4 ]
机构
[1] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[2] Harbin Inst Technol Shenzhen, Peng Cheng Lab, Shenzhen 518055, Peoples R China
[3] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518055, Peoples R China
[4] JD Explore Acad JD Com, Beijing 101111, Peoples R China
[5] Univ Liverpool, Liverpool L69 3BX, England
关键词
Cross-modal retrieval; neural networks; contrastive learning; ROBUST;
D O I
10.1109/TMM.2023.3236837
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.
引用
收藏
页码:8383 / 8395
页数:13
相关论文
共 74 条
[1]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[2]   Distilling Knowledge via Knowledge Review [J].
Chen, Pengguang ;
Liu, Shu ;
Zhao, Hengshuang ;
Jia, Jiaya .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5006-5015
[3]  
Chen T, 2020, PR MACH LEARN RES, V119
[4]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[5]   Contrastive Vision-Language Pre-training with Limited Resources [J].
Cui, Quan ;
Zhou, Boyan ;
Guo, Yu ;
Yin, Weidong ;
Wu, Hao ;
Yoshie, Osamu ;
Chen, Yubo .
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 :236-253
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Ding L., 2021, The USYD-JD speech translation system for IWSLT2021
[8]  
Ding L, 2021, P INT C LEARN REPR, P1
[9]  
Ding L, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P2417
[10]  
Ding L, 2019, FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), P175