Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval

被引：3

作者：

Nian, Fudong ^{[1
,2
]}

Ding, Ling ^{[1
]}

Hu, Yuxia ^{[2
]}

Gu, Yanhong ^{[1
]}

机构：

[1] Hefei Univ, Sch Adv Mfg Engn, Hefei 230601, Peoples R China

[2] Anhui Jianzhu Univ, Anhui Int Joint Res Ctr Ancient Architecture Inte, Hefei 230601, Peoples R China

来源：

MATHEMATICS | 2022年 / 10卷 / 18期

关键词：

video-text retrieval; multi-level space learning; cross-modal similarity calculation; IMAGE;

D O I：

10.3390/math10183346

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

This paper strives to improve the performance of video-text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video-text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video-text retrieval by jointly modeling video-text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial-temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video-text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video-text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.

引用

页数：19

共 60 条

[1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[2] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [J].

Dong, Jianfeng ;

Wang, Yabing ;

Chen, Xianke ;

Qu, Xiaoye ;

Li, Xirong ;

He, Yuan ;

Wang, Xun .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) :5680-5694

[3] Dual Encoding for Video Retrieval by Text [J].

Dong, Jianfeng ;

Li, Xirong ;

Xu, Chaoxi ;

Yang, Xun ;

Yang, Gang ;

Wang, Xun ;

Wang, Meng .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (08) :4065-4080

[4] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval [J].

Dong, Jianfeng ;

Long, Zhongzi ;

Mao, Xiaofeng ;

Lin, Changting ;

He, Yuan ;

Ji, Shouling .

NEUROCOMPUTING, 2021, 440 :207-219

[5] Dual Encoding for Zero-Example Video Retrieval [J].

Dong, Jianfeng ;

Li, Xirong ;

Xu, Chaoxi ;

Ji, Shouling ;

He, Yuan ;

Yang, Gang ;

Wang, Xun .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347

[6] Predicting Visual Features From Text for Image and Video Caption Retrieval [J].

Dong, Jianfeng ;

Li, Xirong ;

Snoek, Cees G. M. .

IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (12) :3377-3388

[7]

Faghri F., 2017, ARXIV

[8]

Fang H., 2021, ARXIV

[9]

Feng Zhangyin, 2020, arXiv

[10] Masking Modalities for Cross-modal Video Retrieval [J].

Gabeur, Valentin ;

Nagrani, Arsha ;

Sun, Chen ;

Alahari, Karteek ;

Schmid, Cordelia .

2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :2111-2120

← 1 2 3 4 5 6 →