Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

被引：25

作者：

Wang, Yunxiao ^{[1
]}

Liu, Meng ^{[2
]}

Wei, Yinwei ^{[3
]}

Cheng, Zhiyong ^{[4
]}

Wang, Yinglong ^{[4
]}

Nie, Liqiang ^{[1
]}

机构：

[1] Shandong Univ, Sch Comp Sci & Technol, Jinan 250100, Peoples R China

[2] Shandong Jianzhu Univ, Sch Comp Sci & Technol, Jinan 250101, Peoples R China

[3] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore

[4] Qilu Univ Technol, Shandong Acad Sci, Shandong Artificial Intelligence Inst, Jinan 250316, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

中国国家自然科学基金;

关键词：

Semantics; Location awareness; Annotations; Visualization; Task analysis; Proposals; Neural networks; Multiple instance learning; siamese alignment network; vision-language alignment; weakly-supervised video moment retrieval;

D O I：

10.1109/TMM.2022.3168424

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video moment retrieval, i.e., localizing the specific video moments within a video given a description query, has attracted substantial attention over the past several years. Although great progress has been achieved thus far, most of existing methods are supervised, which require moment-level temporal annotation information. In contrast, weakly-supervised methods which only need video-level annotations remain largely unexplored. In this paper, we propose a novel end-to-end Siamese alignment network for weakly-supervised video moment retrieval. To be specific, we design a multi-scale Siamese module, which could progressively reduce the semantic gap between the visual and textual modality with the Siamese structure. In addition, we present a context-aware multiple instance learning module by considering the influence of adjacent contexts, enhancing the moment-query and video-query alignment simultaneously. By promoting the matching of both moment-level and video-level, our model can effectively improve the retrieval performance, even if only having weak video level annotations. Extensive experiments on two benchmark datasets, i.e., ActivityNet-Captions and Charades-STA, verify the superiority of our model compared with several state-of-the-art baselines.

引用

页码：3921 / 3933

页数：13

共 51 条

[1]

[Anonymous], P INT C LEARN REPRES

[2]

[Anonymous], IEEE T MULTIMEDIA

[3]

[Anonymous], 2020, CORR

[4]

[Anonymous], 1991, Advances in Neural Information Processing Systems

[5]

Bromley J., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P669, DOI 10.1142/S0218001493000339

[6] Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization [J].

Cao, Da ;

Zeng, Yawen ;

Wei, Xiaochi ;

Nie, Liqiang ;

Hong, Richang ;

Qin, Zheng .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :898-906

[7] STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization [J].

Cao, Da ;

Zeng, Yawen ;

Liu, Meng ;

He, Xiangnan ;

Wang, Meng ;

Qin, Zheng .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :4162-4170

[8] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[9]

Chen JY, 2019, AAAI CONF ARTIF INTE, P8175

[10]

Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162

← 1 2 3 4 5 6 →