Dynamic Contrastive Distillation for Image-Text Retrieval

被引:21
作者
Rao, Jun [1 ]
Ding, Liang [4 ]
Qi, Shuhan [2 ,3 ]
Fang, Meng [5 ]
Liu, Yang [1 ]
Shen, Li [4 ]
Tao, Dacheng [4 ]
机构
[1] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[2] Harbin Inst Technol Shenzhen, Peng Cheng Lab, Shenzhen 518055, Peoples R China
[3] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518055, Peoples R China
[4] JD Explore Acad JD Com, Beijing 101111, Peoples R China
[5] Univ Liverpool, Liverpool L69 3BX, England
关键词
Cross-modal retrieval; neural networks; contrastive learning; ROBUST;
D O I
10.1109/TMM.2023.3236837
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.
引用
收藏
页码:8383 / 8395
页数:13
相关论文
共 74 条
[31]   Matryoshka Peek: Toward Learning Fine-Grained, Robust, Discriminative Features for Product Search [J].
Kyaw, Zawlin ;
Qi, Shuhan ;
Gao, Ke ;
Zhang, Hanwang ;
Zhang, Luming ;
Xiao, Jun ;
Wang, Xuan ;
Chua, Tat-Seng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (06) :1272-1284
[32]  
Lakshminarayanan B, 2017, ADV NEUR IN, V30
[33]   Stacked Cross Attention for Image-Text Matching [J].
Lee, Kuang-Huei ;
Chen, Xi ;
Hua, Gang ;
Hu, Houdong ;
He, Xiaodong .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228
[34]  
Li BY, 2019, AAAI CONF ARTIF INTE, P8577
[35]   Supervised Robust Discrete Multimodal Hashing for Cross-Media Retrieval [J].
Li, Chuan-Xiang ;
Yan, Ting-Kun ;
Luo, Xin ;
Nie, Liqiang ;
Xu, Xin-Shun .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (11) :2863-2877
[36]   Visual Semantic Reasoning for Image-Text Matching [J].
Li, Kunpeng ;
Zhang, Yulun ;
Li, Kai ;
Li, Yuanyuan ;
Fu, Yun .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4653-4661
[37]  
Li L, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P379
[38]  
Li W, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, P2592
[39]   Focal Loss for Dense Object Detection [J].
Lin, Tsung-Yi ;
Goyal, Priya ;
Girshick, Ross ;
He, Kaiming ;
Dollar, Piotr .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2999-3007
[40]   Microsoft COCO: Common Objects in Context [J].
Lin, Tsung-Yi ;
Maire, Michael ;
Belongie, Serge ;
Hays, James ;
Perona, Pietro ;
Ramanan, Deva ;
Dollar, Piotr ;
Zitnick, C. Lawrence .
COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755