Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引：15

作者：

Ma, Wentao ^{[1
]}

Chen, Qingchao ^{[2
]}

Zhou, Tongqing ^{[1
]}

Zhao, Shan ^{[3
]}

Cai, Zhiping ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China

[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;

D O I：

10.1109/TCSVT.2023.3257193

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.

引用

页码：5486 / 5497

页数：12

共 50 条

[1] Joint embeddings with multimodal cues for video-text retrieval
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
[2] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
Liu, Zhi
Zhao, Fangyuan
Zhang, Mengmeng
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
[3] Joint embeddings with multimodal cues for video-text retrieval
Mithun, Niluthpol C.
Li, Juncheng
Metze, Florian
Roy-Chowdhury, Amit K.
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
[4] Expert-guided contrastive learning for video-text retrieval
Lee, Jewook
Lee, Pilhyeon
Park, Sungho
Byun, Hyeran
NEUROCOMPUTING, 2023, 536 : 50 - 58
[5] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
Shen, Xiaobo
Huang, Qianxin
Lan, Long
Zheng, Yuhui
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
[6] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Shu, Fangxun
Chen, Biaolong
Liao, Yue
Wang, Jinqiao
Liu, Si
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
[7] SPSD: Similarity-preserving self-distillation for video-text retrieval
Wang, Jiachen
Hua, Yan
Yang, Yingyun
Kou, Hongwei
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (02)
[8] CLIP2TF:Multimodal video-text retrieval for adolescent education
Sun, Xiaoning
Fan, Tao
Li, Hongxu
Wang, Guozhong
Ge, Peien
Shang, Xiwu
DISPLAYS, 2024, 84
[9] KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
Zhuang, Xianwei
Li, Hongxiang
Cheng, Xuxin
Zhu, Zhihong
Xie, Yuxin
Zou, Yuexian
COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 313 - 331
[10] Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval
Feng, Zerun
Zeng, Zhimin
Guo, Caili
Li, Zheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1438 - 1453

← 1 2 3 4 5 →