Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval

被引:3
作者
Nian, Fudong [1 ,2 ]
Ding, Ling [1 ]
Hu, Yuxia [2 ]
Gu, Yanhong [1 ]
机构
[1] Hefei Univ, Sch Adv Mfg Engn, Hefei 230601, Peoples R China
[2] Anhui Jianzhu Univ, Anhui Int Joint Res Ctr Ancient Architecture Inte, Hefei 230601, Peoples R China
关键词
video-text retrieval; multi-level space learning; cross-modal similarity calculation; IMAGE;
D O I
10.3390/math10183346
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
This paper strives to improve the performance of video-text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video-text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video-text retrieval by jointly modeling video-text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial-temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video-text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video-text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.
引用
收藏
页数:19
相关论文
共 60 条
[1]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[2]   Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [J].
Dong, Jianfeng ;
Wang, Yabing ;
Chen, Xianke ;
Qu, Xiaoye ;
Li, Xirong ;
He, Yuan ;
Wang, Xun .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) :5680-5694
[3]   Dual Encoding for Video Retrieval by Text [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Yang, Xun ;
Yang, Gang ;
Wang, Xun ;
Wang, Meng .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (08) :4065-4080
[4]   Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval [J].
Dong, Jianfeng ;
Long, Zhongzi ;
Mao, Xiaofeng ;
Lin, Changting ;
He, Yuan ;
Ji, Shouling .
NEUROCOMPUTING, 2021, 440 :207-219
[5]   Dual Encoding for Zero-Example Video Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Ji, Shouling ;
He, Yuan ;
Yang, Gang ;
Wang, Xun .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347
[6]   Predicting Visual Features From Text for Image and Video Caption Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Snoek, Cees G. M. .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (12) :3377-3388
[7]  
Faghri F., 2017, ARXIV
[8]  
Fang H., 2021, ARXIV
[9]  
Feng Zhangyin, 2020, arXiv
[10]   Masking Modalities for Cross-modal Video Retrieval [J].
Gabeur, Valentin ;
Nagrani, Arsha ;
Sun, Chen ;
Alahari, Karteek ;
Schmid, Cordelia .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :2111-2120