Dynamic Contrastive Distillation for Image-Text Retrieval

被引：21

作者：

Rao, Jun ^{[1
]}

Ding, Liang ^{[4
]}

Qi, Shuhan ^{[2
,3
]}

Fang, Meng ^{[5
]}

Liu, Yang ^{[1
]}

Shen, Li ^{[4
]}

Tao, Dacheng ^{[4
]}

机构：

[1] Harbin Inst Technol, Shenzhen 518055, Peoples R China

[2] Harbin Inst Technol Shenzhen, Peng Cheng Lab, Shenzhen 518055, Peoples R China

[3] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518055, Peoples R China

[4] JD Explore Acad JD Com, Beijing 101111, Peoples R China

[5] Univ Liverpool, Liverpool L69 3BX, England

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Cross-modal retrieval; neural networks; contrastive learning; ROBUST;

D O I：

10.1109/TMM.2023.3236837

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.

引用

页码：8383 / 8395

页数：13

共 74 条

[51] Learning Granularity-Unified Representations for Text-to-Image Person Re-identification [J].

Shao, Zhiyin ;

Zhang, Xinyu ;

Fang, Meng ;

Lin, Zhifeng ;

Wang, Jian ;

Ding, Changxing .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :5566-5574

[52] Training Region-based Object Detectors with Online Hard Example Mining [J].

Shrivastava, Abhinav ;

Gupta, Abhinav ;

Girshick, Ross .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :761-769

[53]

Sohn K, 2016, ADV NEUR IN, V29

[54]

Sun SQ, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4323

[55]

Sun ZQ, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2158

[56]

Tang S, 2019, P BRIT MACH VIS C, P1

[57]

van den Oord A., 2018, CoRR abs/1807.03748

[58] Autonomous deck landing of a vertical take-off and landing unmanned aerial vehicle based on the tau theory [J].