Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引:15
|
作者
Ma, Wentao [1 ]
Chen, Qingchao [2 ]
Zhou, Tongqing [1 ]
Zhao, Shan [3 ]
Cai, Zhiping [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China
[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;
D O I
10.1109/TCSVT.2023.3257193
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.
引用
收藏
页码:5486 / 5497
页数:12
相关论文
共 50 条
  • [31] Spatiotemporal contrastive modeling for video moment retrieval
    Wang, Yi
    Li, Kun
    Chen, Guoliang
    Zhang, Yan
    Guo, Dan
    Wang, Meng
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (04): : 1525 - 1544
  • [32] TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio
    Wang, Xin
    Meng, Benyuan
    Chen, Hong
    Meng, Yuan
    Lv, Ke
    Zhu, Wenwu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2391 - 2399
  • [33] Spatiotemporal contrastive modeling for video moment retrieval
    Yi Wang
    Kun Li
    Guoliang Chen
    Yan Zhang
    Dan Guo
    Meng Wang
    World Wide Web, 2023, 26 : 1525 - 1544
  • [34] Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
    Luo, Kaiyi
    Zhang, Xulong
    Wang, Jianzong
    Li, Huaxiong
    Cheng, Ning
    Xiao, Jing
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 913 - 917
  • [35] Multimodal Knowledge Distillation in Spectral Imaging
    Lopes, Tomas
    Capela, Diana
    Ferreira, Miguel F. S.
    Teixeira, Joana
    Silva, Catarina
    Guimaraes, Diana F.
    Jorge, Pedro A. S.
    Silva, Nuno A.
    OPTICAL SENSING AND DETECTION VIII, 2024, 12999
  • [36] Enhanced Text Classification using Proxy Labels and Knowledge Distillation
    Sukumaran, Rohan
    Prabhu, Sumanth
    Misra, Hemant
    PROCEEDINGS OF THE 5TH JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA, CODS COMAD 2022, 2022, : 227 - 230
  • [37] Identity-Aware Contrastive Knowledge Distillation for Facial Attribute Recognition
    Chen, Si
    Zhu, Xueyan
    Yan, Yan
    Zhu, Shunzhi
    Li, Shao-Zi
    Wang, Da-Han
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5692 - 5706
  • [38] Domain Knowledge Distillation and Supervised Contrastive Learning for Industrial Process Monitoring
    Ai, Mingxi
    Xie, Yongfang
    Ding, Steven X. X.
    Tang, Zhaohui
    Gui, Weihua
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2023, 70 (09) : 9452 - 9462
  • [39] Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
    Wang, Jinpeng
    Chen, Bin
    Liao, Dongliang
    Zeng, Ziyun
    Li, Gongfu
    Xia, Shu-Tao
    Xu, Jin
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 3020 - 3030
  • [40] Categorical Relation-Preserving Contrastive Knowledge Distillation for Medical Image Classification
    Xing, Xiaohan
    Hou, Yuenan
    Li, Hang
    Yuan, Yixuan
    Li, Hongsheng
    Meng, Max Q-H
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT V, 2021, 12905 : 163 - 173