Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引:15
|
作者
Ma, Wentao [1 ]
Chen, Qingchao [2 ]
Zhou, Tongqing [1 ]
Zhao, Shan [3 ]
Cai, Zhiping [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China
[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;
D O I
10.1109/TCSVT.2023.3257193
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.
引用
收藏
页码:5486 / 5497
页数:12
相关论文
共 50 条
  • [1] Joint embeddings with multimodal cues for video-text retrieval
    Niluthpol C. Mithun
    Juncheng Li
    Florian Metze
    Amit K. Roy-Chowdhury
    International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
  • [2] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
    Liu, Zhi
    Zhao, Fangyuan
    Zhang, Mengmeng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
  • [3] Joint embeddings with multimodal cues for video-text retrieval
    Mithun, Niluthpol C.
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
  • [4] Expert-guided contrastive learning for video-text retrieval
    Lee, Jewook
    Lee, Pilhyeon
    Park, Sungho
    Byun, Hyeran
    NEUROCOMPUTING, 2023, 536 : 50 - 58
  • [5] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [6] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
    Shu, Fangxun
    Chen, Biaolong
    Liao, Yue
    Wang, Jinqiao
    Liu, Si
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
  • [7] SPSD: Similarity-preserving self-distillation for video-text retrieval
    Wang, Jiachen
    Hua, Yan
    Yang, Yingyun
    Kou, Hongwei
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (02)
  • [8] CLIP2TF:Multimodal video-text retrieval for adolescent education
    Sun, Xiaoning
    Fan, Tao
    Li, Hongxu
    Wang, Guozhong
    Ge, Peien
    Shang, Xiwu
    DISPLAYS, 2024, 84
  • [9] KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
    Zhuang, Xianwei
    Li, Hongxiang
    Cheng, Xuxin
    Zhu, Zhihong
    Xie, Yuxin
    Zou, Yuexian
    COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 313 - 331
  • [10] Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval
    Feng, Zerun
    Zeng, Zhimin
    Guo, Caili
    Li, Zheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1438 - 1453