Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引：15

作者：

Ma, Wentao ^{[1
]}

Chen, Qingchao ^{[2
]}

Zhou, Tongqing ^{[1
]}

Zhao, Shan ^{[3
]}

Cai, Zhiping ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China

[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;

D O I：

10.1109/TCSVT.2023.3257193

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.

引用

页码：5486 / 5497

页数：12

共 50 条

[31] Spatiotemporal contrastive modeling for video moment retrieval
Wang, Yi
Li, Kun
Chen, Guoliang
Zhang, Yan
Guo, Dan
Wang, Meng
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (04): : 1525 - 1544
[32] TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio
Wang, Xin
Meng, Benyuan
Chen, Hong
Meng, Yuan
Lv, Ke
Zhu, Wenwu
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2391 - 2399
[33] Spatiotemporal contrastive modeling for video moment retrieval
Yi Wang
Kun Li
Guoliang Chen
Yan Zhang
Dan Guo
Meng Wang
World Wide Web, 2023, 26 : 1525 - 1544
[34] Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
Luo, Kaiyi
Zhang, Xulong
Wang, Jianzong
Li, Huaxiong
Cheng, Ning
Xiao, Jing
2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 913 - 917
[35] Multimodal Knowledge Distillation in Spectral Imaging
Lopes, Tomas
Capela, Diana
Ferreira, Miguel F. S.
Teixeira, Joana
Silva, Catarina
Guimaraes, Diana F.
Jorge, Pedro A. S.
Silva, Nuno A.
OPTICAL SENSING AND DETECTION VIII, 2024, 12999
[36] Enhanced Text Classification using Proxy Labels and Knowledge Distillation
Sukumaran, Rohan
Prabhu, Sumanth
Misra, Hemant
PROCEEDINGS OF THE 5TH JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA, CODS COMAD 2022, 2022, : 227 - 230
[37] Identity-Aware Contrastive Knowledge Distillation for Facial Attribute Recognition
Chen, Si
Zhu, Xueyan
Yan, Yan
Zhu, Shunzhi
Li, Shao-Zi
Wang, Da-Han
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5692 - 5706
[38] Domain Knowledge Distillation and Supervised Contrastive Learning for Industrial Process Monitoring
Ai, Mingxi
Xie, Yongfang
Ding, Steven X. X.
Tang, Zhaohui
Gui, Weihua
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2023, 70 (09) : 9452 - 9462
[39] Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
Wang, Jinpeng
Chen, Bin
Liao, Dongliang
Zeng, Ziyun
Li, Gongfu
Xia, Shu-Tao
Xu, Jin
PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 3020 - 3030
[40] Categorical Relation-Preserving Contrastive Knowledge Distillation for Medical Image Classification
Xing, Xiaohan
Hou, Yuenan
Li, Hang
Yuan, Yixuan
Li, Hongsheng
Meng, Max Q-H
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT V, 2021, 12905 : 163 - 173

← 1 2 3 4 5 →