Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引:15
|
作者
Ma, Wentao [1 ]
Chen, Qingchao [2 ]
Zhou, Tongqing [1 ]
Zhao, Shan [3 ]
Cai, Zhiping [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China
[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;
D O I
10.1109/TCSVT.2023.3257193
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.
引用
收藏
页码:5486 / 5497
页数:12
相关论文
共 50 条
  • [41] Dual Encoding for Video Retrieval by Text
    Dong, Jianfeng
    Li, Xirong
    Xu, Chaoxi
    Yang, Xun
    Yang, Gang
    Wang, Xun
    Wang, Meng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (08) : 4065 - 4080
  • [42] Knowledge Distillation for Single Image Super-Resolution via Contrastive Learning
    Liu, Cencen
    Zhang, Dongyang
    Qin, Ke
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1079 - 1083
  • [43] Text-Centric Multimodal Contrastive Learning for Sentiment Analysis
    Peng, Heng
    Gu, Xue
    Li, Jian
    Wang, Zhaodan
    Xu, Hao
    ELECTRONICS, 2024, 13 (06)
  • [44] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
    Feng, Duoduo
    He, Xiangteng
    Peng, Yuxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [45] Multimodal Learning with Incomplete Modalities by Knowledge Distillation
    Wang, Qi
    Zhan, Liang
    Thompson, Paul
    Zhou, Jiayu
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1828 - 1838
  • [46] Multi-views contrastive learning for dense text retrieval
    Yu, Yang
    Zeng, Jun
    Zhong, Lin
    Gao, Min
    Wen, Junhao
    Wu, Yingbo
    KNOWLEDGE-BASED SYSTEMS, 2023, 274
  • [47] AdaCLIP: Towards Pragmatic Multimodal Video Retrieval
    Hu, Zhiming
    Ye, Angela Ning
    Khorasgani, Salar Hosseini
    Mohomed, Iqbal
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5623 - 5633
  • [48] Video Summarization Using Knowledge Distillation-Based Attentive Network
    Qin, Jialin
    Yu, Hui
    Liang, Wei
    Ding, Derui
    COGNITIVE COMPUTATION, 2024, 16 (03) : 1022 - 1031
  • [49] Knowledge Distillation Hashing for Occluded Face Retrieval
    Yang, Yuxiang
    Tian, Xing
    Ng, Wing W. Y.
    Gao, Ying
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9096 - 9107
  • [50] Knowledge Distillation and Contrastive Learning for Detecting Visible-Infrared Transmission Lines Using Separated Stagger Registration Network
    Zhou, Wujie
    Wang, Yusen
    Qian, Xiaohong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2025,