Cross-Modal Video Retrieval Model Based on Video- Text Dual Alignment

被引:0
作者
Che, Zhanbin [1 ]
Guo, Huaili [1 ]
机构
[1] Zhongyuan Univ Technol, Coll Comp, Zhengzhou 450007, Henan, Peoples R China
关键词
Video-text alignment; cross-modal; contrastive learning; similarity measure; feature fusion;
D O I
10.14569/IJACSA.2024.0150232
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cross-modal video retrieval remains a major challenge in natural language processing due to the natural semantic divide between video and text. Most approaches use a single encoder to extract video and text features separately, and train video-text pairs by means of contrastive learning, but this global alignment of video and text is prone to neglecting more fine-grained features of both. In addition, some studies focus only on profiling the video description text, ignoring the correlation relationship with the video. Therefore, this paper proposes a video retrieval method based on video-text alignment, which realizes both global and fine-grained alignment between video and text. For global alignment, the video and text are aligned by a single encoder and after linear projection; for fine-grained alignment, the video encoder is trained to align the video and text by masking some semantic information in the text. By experimentally comparing with multiple existing methods on MSR-VTT and MSVD datasets, the model achieves R@1 (recall at 1) metrics of 51.5% and 52.4% on MSR-VTT and MSVD datasets, respectively, which indicates that the proposed model can improve the efficiency of cross-modal video retrieval.
引用
收藏
页码:303 / 311
页数:9
相关论文
共 37 条
[1]  
[Anonymous], 2011, P 49 ANN M ASS COMPU
[2]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[3]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[4]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]   Dual Encoding for Zero-Example Video Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Ji, Shouling ;
He, Yuan ;
Yang, Gang ;
Wang, Xun .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347
[7]  
Dosovitskiy A., 2021, 9 INT C LEARN REPR I
[8]   Multi-modal Transformer for Video Retrieval [J].
Gabeur, Valentin ;
Sun, Chen ;
Alahari, Karteek ;
Schmid, Cordelia .
COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :214-229
[9]   X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].
Gorti, Satya Krishna ;
Vouitsis, Noel ;
Ma, Junwei ;
Golestan, Keyvan ;
Volkovs, Maksims ;
Garg, Animesh ;
Yu, Guangwei .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005
[10]   Clover : Towards A Unified Video-Language Alignment and Fusion Model [J].
Huang, Jingjia ;
Li, Yinan ;
Feng, Jiashi ;
Wu, Xinglong ;
Sun, Xiaoshuai ;
Ji, Rongrong .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14856-14866