Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引：15

作者：

Ma, Wentao ^{[1
]}

Chen, Qingchao ^{[2
]}

Zhou, Tongqing ^{[1
]}

Zhao, Shan ^{[3
]}

Cai, Zhiping ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China

[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;

D O I：

10.1109/TCSVT.2023.3257193

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.

引用

页码：5486 / 5497

页数：12

共 50 条

[41] Dual Encoding for Video Retrieval by Text
Dong, Jianfeng
Li, Xirong
Xu, Chaoxi
Yang, Xun
Yang, Gang
Wang, Xun
Wang, Meng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (08) : 4065 - 4080
[42] Knowledge Distillation for Single Image Super-Resolution via Contrastive Learning
Liu, Cencen
Zhang, Dongyang
Qin, Ke
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1079 - 1083
[43] Text-Centric Multimodal Contrastive Learning for Sentiment Analysis
Peng, Heng
Gu, Xue
Li, Jian
Wang, Zhaodan
Xu, Hao
ELECTRONICS, 2024, 13 (06)
[44] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
Feng, Duoduo
He, Xiangteng
Peng, Yuxin
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
[45] Multimodal Learning with Incomplete Modalities by Knowledge Distillation
Wang, Qi
Zhan, Liang
Thompson, Paul
Zhou, Jiayu
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1828 - 1838
[46] Multi-views contrastive learning for dense text retrieval
Yu, Yang
Zeng, Jun
Zhong, Lin
Gao, Min
Wen, Junhao
Wu, Yingbo
KNOWLEDGE-BASED SYSTEMS, 2023, 274
[47] AdaCLIP: Towards Pragmatic Multimodal Video Retrieval
Hu, Zhiming
Ye, Angela Ning
Khorasgani, Salar Hosseini
Mohomed, Iqbal
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5623 - 5633
[48] Video Summarization Using Knowledge Distillation-Based Attentive Network
Qin, Jialin
Yu, Hui
Liang, Wei
Ding, Derui
COGNITIVE COMPUTATION, 2024, 16 (03) : 1022 - 1031
[49] Knowledge Distillation Hashing for Occluded Face Retrieval
Yang, Yuxiang
Tian, Xing
Ng, Wing W. Y.
Gao, Ying
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9096 - 9107
[50] Knowledge Distillation and Contrastive Learning for Detecting Visible-Infrared Transmission Lines Using Separated Stagger Registration Network
Zhou, Wujie
Wang, Yusen
Qian, Xiaohong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2025,

← 1 2 3 4 5 →