Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引:15
|
作者
Ma, Wentao [1 ]
Chen, Qingchao [2 ]
Zhou, Tongqing [1 ]
Zhao, Shan [3 ]
Cai, Zhiping [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China
[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;
D O I
10.1109/TCSVT.2023.3257193
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.
引用
收藏
页码:5486 / 5497
页数:12
相关论文
共 50 条
  • [11] FeatInter: Exploring fine-grained object features for video-text retrieval
    Liu, Baolong
    Zheng, Qi
    Wang, Yabing
    Zhang, Minsong
    Dong, Jianfeng
    Wang, Xun
    NEUROCOMPUTING, 2022, 496 : 178 - 191
  • [12] Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features
    Nabati, Masoomeh
    Behrad, Alireza
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [13] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
    Wang, Wei
    Gao, Junyu
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
  • [14] VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning
    Wang, Yanan
    Zeng, Donghuo
    Wada, Shinya
    Kurihara, Satoshi
    IEEE ACCESS, 2023, 11 : 51229 - 51240
  • [15] SPSD: Similarity-preserving self-distillation for video–text retrieval
    Jiachen Wang
    Yan Hua
    Yingyun Yang
    Hongwei Kou
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [16] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    MATHEMATICS, 2022, 10 (18)
  • [17] Video Moment Retrieval with Hierarchical Contrastive Learning
    Zhang, Bolin
    Yang, Chao
    Jiang, Bin
    Zhou, Xiaokang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [18] Video Corpus Moment Retrieval with Contrastive Learning
    Zhang, Hao
    Sun, Aixin
    Jing, Wei
    Nan, Guoshun
    Zhen, Liangli
    Zhou, Joey Tianyi
    Goh, Rick Siow Mong
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 685 - 695
  • [19] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
    Zhuo, Yaoxin
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    Li, Baoxin
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 158 - 166
  • [20] Porn Streamer Recognition in Live Video Based on Multimodal Knowledge Distillation
    Wang Liyuan
    Zhang Jing
    Yao Jiacheng
    Zhuo Li
    CHINESE JOURNAL OF ELECTRONICS, 2021, 30 (06) : 1096 - 1102