Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引:15
|
作者
Ma, Wentao [1 ]
Chen, Qingchao [2 ]
Zhou, Tongqing [1 ]
Zhao, Shan [3 ]
Cai, Zhiping [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China
[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;
D O I
10.1109/TCSVT.2023.3257193
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.
引用
收藏
页码:5486 / 5497
页数:12
相关论文
共 50 条
  • [21] Guided Graph Attention Learning for Video-Text Matching
    Li, Kunpeng
    Liu, Chang
    Stopa, Mike
    Amano, Jun
    Fu, Yun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
  • [22] Learning continuation: Integrating past knowledge for contrastive distillation
    Zhang, Bowen
    Qin, Jiaohua
    Xiang, Xuyu
    Tan, Yun
    KNOWLEDGE-BASED SYSTEMS, 2024, 304
  • [23] Hybrid mix-up contrastive knowledge distillation
    Zhang, Jian
    Tao, Ze
    Guo, Kehua
    Li, Haowei
    Zhang, Shichao
    INFORMATION SCIENCES, 2024, 660
  • [24] ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval
    Liu, Zhen
    Zhu, Yongxin
    Gao, Zhujin
    Sheng, Xin
    Xu, Linli
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT III, 2023, 13937 : 257 - 268
  • [25] Ensemble Modeling with Contrastive Knowledge Distillation for Sequential Recommendation
    Du, Hanwen
    Yuan, Huanhuan
    Zhao, Pengpeng
    Zhuang, Fuzhen
    Liu, Guanfeng
    Zhao, Lei
    Liu, Yanchi
    Sheng, Victor S.
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 58 - 67
  • [26] STABLE KNOWLEDGE TRANSFER FOR CONTRASTIVE DISTILLATION
    Tang, Qiankun
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4995 - 4999
  • [27] Contrastive Knowledge Distillation Method Based on Feature Space Embedding
    Ye F.
    Chen B.
    Lai Y.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2023, 51 (05): : 13 - 23
  • [28] ABUS tumor segmentation via decouple contrastive knowledge distillation
    Pan, Pan
    Li, Yanfeng
    Chen, Houjin
    Sun, Jia
    Li, Xiaoling
    Cheng, Lin
    PHYSICS IN MEDICINE AND BIOLOGY, 2024, 69 (01):
  • [29] Improved Vector Quantization For Dense Retrieval with Contrastive Distillation
    O'Neill, James
    Dutta, Sourav
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2072 - 2076
  • [30] Leukocyte Classification Using Multimodal Architecture Enhanced by Knowledge Distillation
    Yang, Litao
    Mehta, Deval
    Mahapatra, Dwarikanath
    Ge, Zongyuan
    MEDICAL OPTICAL IMAGING AND VIRTUAL MICROSCOPY IMAGE ANALYSIS, MOVI 2022, 2022, 13578 : 63 - 72