Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

被引:25
作者
Wang, Yunxiao [1 ]
Liu, Meng [2 ]
Wei, Yinwei [3 ]
Cheng, Zhiyong [4 ]
Wang, Yinglong [4 ]
Nie, Liqiang [1 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Jinan 250100, Peoples R China
[2] Shandong Jianzhu Univ, Sch Comp Sci & Technol, Jinan 250101, Peoples R China
[3] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[4] Qilu Univ Technol, Shandong Acad Sci, Shandong Artificial Intelligence Inst, Jinan 250316, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Location awareness; Annotations; Visualization; Task analysis; Proposals; Neural networks; Multiple instance learning; siamese alignment network; vision-language alignment; weakly-supervised video moment retrieval;
D O I
10.1109/TMM.2022.3168424
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video moment retrieval, i.e., localizing the specific video moments within a video given a description query, has attracted substantial attention over the past several years. Although great progress has been achieved thus far, most of existing methods are supervised, which require moment-level temporal annotation information. In contrast, weakly-supervised methods which only need video-level annotations remain largely unexplored. In this paper, we propose a novel end-to-end Siamese alignment network for weakly-supervised video moment retrieval. To be specific, we design a multi-scale Siamese module, which could progressively reduce the semantic gap between the visual and textual modality with the Siamese structure. In addition, we present a context-aware multiple instance learning module by considering the influence of adjacent contexts, enhancing the moment-query and video-query alignment simultaneously. By promoting the matching of both moment-level and video-level, our model can effectively improve the retrieval performance, even if only having weak video level annotations. Extensive experiments on two benchmark datasets, i.e., ActivityNet-Captions and Charades-STA, verify the superiority of our model compared with several state-of-the-art baselines.
引用
收藏
页码:3921 / 3933
页数:13
相关论文
共 51 条
[1]  
[Anonymous], P INT C LEARN REPRES
[2]  
[Anonymous], IEEE T MULTIMEDIA
[3]  
[Anonymous], 2020, CORR
[4]  
[Anonymous], 1991, Advances in Neural Information Processing Systems
[5]  
Bromley J., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P669, DOI 10.1142/S0218001493000339
[6]   Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization [J].
Cao, Da ;
Zeng, Yawen ;
Wei, Xiaochi ;
Nie, Liqiang ;
Hong, Richang ;
Qin, Zheng .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :898-906
[7]   STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization [J].
Cao, Da ;
Zeng, Yawen ;
Liu, Meng ;
He, Xiangnan ;
Wang, Meng ;
Qin, Zheng .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :4162-4170
[8]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[9]  
Chen JY, 2019, AAAI CONF ARTIF INTE, P8175
[10]  
Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162