Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval

被引:3
作者
Chen, Lei [1 ]
Deng, Zhen [2 ]
Liu, Libo [1 ]
Yin, Shibai [2 ]
机构
[1] Ningxia Univ, Coll Informat Engn, Yinchuan 750021, Peoples R China
[2] Ningxia Univ, Coll Informat Engn, Yinchuan 611130, Peoples R China
基金
中国国家自然科学基金;
关键词
Weak semantic data; video-text retrieval; cross-modal retrieval; cross-alignment; attention mechanism;
D O I
10.1109/TCSVT.2024.3360530
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video-text cross-modal retrieval (VTR) is more natural and challenging than image-text retrieval, which has attracted increasing interest from researchers in recent years. To align VTR more closely with real-world scenarios, i.e., weak semantic text description as a query, we propose a multilevel semantic interaction alignment (MSIA) model. We develop a two-stream network, which decomposes video and text alignment into multiple dimensions. Specifically, in the video stream, to better align heterogeneity data, redundant video information is suppressed via the designed frame adaptation attention mechanism, and richer semantic interaction is achieved through a text-guided attention mechanism. Then, for text alignment in the video local region, we design a distinctive anchor frame strategy and a word selection method. Finally, a cross-granularity alignment approach is designed to learn more and finer semantic features. With the above schema, the alignment between video and weak semantic text descriptions is reinforced, further alleviating the issues of difficult alignment caused by weak semantic text descriptions. The experimental results on VTR benchmark datasets show the competitive performance of our approach in comparison to that of state-of-the-art methods. The code is available at: https://github.com/jiaranjintianchism/MSIA.
引用
收藏
页码:6559 / 6575
页数:17
相关论文
共 68 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
Bai JB, 2022, Arxiv, DOI arXiv:2207.04858
[3]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[4]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[5]  
Chen D., 2011, P 49 ANN M ASS COMP, P190
[6]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[7]  
Cheng X., 2021, arXiv
[8]   TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].
Croitoru, Ioana ;
Bogolin, Simion-Vlad ;
Leordeanu, Marius ;
Jin, Hailin ;
Zisserman, Andrew ;
Albanie, Samuel ;
Liu, Yang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]   Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [J].
Dong, Jianfeng ;
Wang, Yabing ;
Chen, Xianke ;
Qu, Xiaoye ;
Li, Xirong ;
He, Yuan ;
Wang, Xun .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) :5680-5694