Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

被引：15

作者：

Ma, Wentao ^{[1
]}

Chen, Qingchao ^{[2
]}

Zhou, Tongqing ^{[1
]}

Zhao, Shan ^{[3
]}

Cai, Zhiping ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp Sci, Changsha 410073, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100091, Peoples R China

[3] Hefei Univ Technol, Sch Comp & Informat Engn, Hefei 230009, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; contrastive learning; knowledge distillation; IMAGE;

D O I：

10.1109/TCSVT.2023.3257193

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.

引用

页码：5486 / 5497

页数：12

共 50 条

[21] Guided Graph Attention Learning for Video-Text Matching
Li, Kunpeng
Liu, Chang
Stopa, Mike
Amano, Jun
Fu, Yun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
[22] Learning continuation: Integrating past knowledge for contrastive distillation
Zhang, Bowen
Qin, Jiaohua
Xiang, Xuyu
Tan, Yun
KNOWLEDGE-BASED SYSTEMS, 2024, 304
[23] Hybrid mix-up contrastive knowledge distillation
Zhang, Jian
Tao, Ze
Guo, Kehua
Li, Haowei
Zhang, Shichao
INFORMATION SCIENCES, 2024, 660
[24] ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval
Liu, Zhen
Zhu, Yongxin
Gao, Zhujin
Sheng, Xin
Xu, Linli
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT III, 2023, 13937 : 257 - 268
[25] Ensemble Modeling with Contrastive Knowledge Distillation for Sequential Recommendation
Du, Hanwen
Yuan, Huanhuan
Zhao, Pengpeng
Zhuang, Fuzhen
Liu, Guanfeng
Zhao, Lei
Liu, Yanchi
Sheng, Victor S.
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 58 - 67
[26] STABLE KNOWLEDGE TRANSFER FOR CONTRASTIVE DISTILLATION
Tang, Qiankun
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 4995 - 4999
[27] Contrastive Knowledge Distillation Method Based on Feature Space Embedding
Ye F.
Chen B.
Lai Y.
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2023, 51 (05): : 13 - 23
[28] ABUS tumor segmentation via decouple contrastive knowledge distillation
Pan, Pan
Li, Yanfeng
Chen, Houjin
Sun, Jia
Li, Xiaoling
Cheng, Lin
PHYSICS IN MEDICINE AND BIOLOGY, 2024, 69 (01):
[29] Improved Vector Quantization For Dense Retrieval with Contrastive Distillation
O'Neill, James
Dutta, Sourav
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2072 - 2076
[30] Leukocyte Classification Using Multimodal Architecture Enhanced by Knowledge Distillation
Yang, Litao
Mehta, Deval
Mahapatra, Dwarikanath
Ge, Zongyuan
MEDICAL OPTICAL IMAGING AND VIRTUAL MICROSCOPY IMAGE ANALYSIS, MOVI 2022, 2022, 13578 : 63 - 72

← 1 2 3 4 5 →