Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引：15

作者：

Ma, Wentao ^{[1
]}

Chen, Qingchao ^{[2
]}

Zhou, Tongqing ^{[1
]}

Zhao, Shan ^{[3
]}

Cai, Zhiping ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China

[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;

D O I：

10.1109/TCSVT.2023.3257193

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.

引用

页码：5486 / 5497

页数：12

共 50 条

[11] FeatInter: Exploring fine-grained object features for video-text retrieval
Liu, Baolong
Zheng, Qi
Wang, Yabing
Zhang, Minsong
Dong, Jianfeng
Wang, Xun
NEUROCOMPUTING, 2022, 496 : 178 - 191
[12] Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features
Nabati, Masoomeh
Behrad, Alireza
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
[13] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
Wang, Wei
Gao, Junyu
Yang, Xiaoshan
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
[14] VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning
Wang, Yanan
Zeng, Donghuo
Wada, Shinya
Kurihara, Satoshi
IEEE ACCESS, 2023, 11 : 51229 - 51240
[15] SPSD: Similarity-preserving self-distillation for video–text retrieval
Jiachen Wang
Yan Hua
Yingyun Yang
Hongwei Kou
International Journal of Multimedia Information Retrieval, 2023, 12
[16] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
Nian, Fudong
Ding, Ling
Hu, Yuxia
Gu, Yanhong
MATHEMATICS, 2022, 10 (18)
[17] Video Moment Retrieval with Hierarchical Contrastive Learning
Zhang, Bolin
Yang, Chao
Jiang, Bin
Zhou, Xiaokang
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
[18] Video Corpus Moment Retrieval with Contrastive Learning
Zhang, Hao
Sun, Aixin
Jing, Wei
Nan, Guoshun
Zhen, Liangli
Zhou, Joey Tianyi
Goh, Rick Siow Mong
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 685 - 695
[19] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
Zhuo, Yaoxin
Li, Yikang
Hsiao, Jenhao
Ho, Chiuman
Li, Baoxin
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 158 - 166
[20] Porn Streamer Recognition in Live Video Based on Multimodal Knowledge Distillation
Wang Liyuan
Zhang Jing
Yao Jiacheng
Zhuo Li
CHINESE JOURNAL OF ELECTRONICS, 2021, 30 (06) : 1096 - 1102

← 1 2 3 4 5 →