Cross-Modal Video Retrieval Model Based on Video- Text Dual Alignment

被引：0

作者：

Che, Zhanbin ^{[1
]}

Guo, Huaili ^{[1
]}

机构：

[1] Zhongyuan Univ Technol, Coll Comp, Zhengzhou 450007, Henan, Peoples R China

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2024年 / 15卷 / 02期

关键词：

Video-text alignment; cross-modal; contrastive learning; similarity measure; feature fusion;

D O I：

10.14569/IJACSA.2024.0150232

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Cross-modal video retrieval remains a major challenge in natural language processing due to the natural semantic divide between video and text. Most approaches use a single encoder to extract video and text features separately, and train video-text pairs by means of contrastive learning, but this global alignment of video and text is prone to neglecting more fine-grained features of both. In addition, some studies focus only on profiling the video description text, ignoring the correlation relationship with the video. Therefore, this paper proposes a video retrieval method based on video-text alignment, which realizes both global and fine-grained alignment between video and text. For global alignment, the video and text are aligned by a single encoder and after linear projection; for fine-grained alignment, the video encoder is trained to align the video and text by masking some semantic information in the text. By experimentally comparing with multiple existing methods on MSR-VTT and MSVD datasets, the model achieves R@1 (recall at 1) metrics of 51.5% and 52.4% on MSR-VTT and MSVD datasets, respectively, which indicates that the proposed model can improve the efficiency of cross-modal video retrieval.

引用

页码：303 / 311

页数：9

共 37 条

[1]

[Anonymous], 2011, P 49 ANN M ASS COMPU

[2] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[3]

Bertasius G, 2021, PR MACH LEARN RES, V139

[4] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].

Chen, Shizhe ;

Zhao, Yida ;

Jin, Qin ;

Wu, Qi .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6] Dual Encoding for Zero-Example Video Retrieval [J].

Dong, Jianfeng ;

Li, Xirong ;

Xu, Chaoxi ;

Ji, Shouling ;

He, Yuan ;

Yang, Gang ;

Wang, Xun .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347

[7]

Dosovitskiy A., 2021, 9 INT C LEARN REPR I

[8] Multi-modal Transformer for Video Retrieval [J].

Gabeur, Valentin ;

Sun, Chen ;

Alahari, Karteek ;

Schmid, Cordelia .

COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :214-229

[9] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].

Gorti, Satya Krishna ;

Vouitsis, Noel ;

Ma, Junwei ;

Golestan, Keyvan ;

Volkovs, Maksims ;

Garg, Animesh ;

Yu, Guangwei .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005

[10] Clover : Towards A Unified Video-Language Alignment and Fusion Model [J].

Huang, Jingjia ;

Li, Yinan ;

Feng, Jiashi ;

Wu, Xinglong ;

Sun, Xiaoshuai ;

Ji, Rongrong .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14856-14866

← 1 2 3 4 →