Query-aware video encoder for video moment retrieval

被引:10
作者
Hao, Jiachang [1 ]
Sun, Haifeng [1 ]
Ren, Pengfei [1 ]
Wang, Jingyu [1 ]
Qi, Qi [1 ]
Liao, Jianxin [1 ]
机构
[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
基金
中国国家自然科学基金;
关键词
Video moment retrieval; Temporal sentence grounding; Video and language; LOCALIZATION; LANGUAGE;
D O I
10.1016/j.neucom.2022.01.085
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an untrimmed video and a sentence query, video moment retrieval is to locate a target video moment that semantically corresponds to the query. It is a challenging task that requires a joint understanding of natural language queries and video contents. However, video contains complex contents, including query-related and query-irrelevant contents, which brings difficulty for the joint understanding. To this end, we propose a query-aware video encoder to capture the query-related visual contents. Specifically, we design a query-guided block following each encoder layer to recalibrate the encoded visual features according to the query semantics. The core of query-guided block is a channel-level attention gating mechanism, which could selectively emphasize query-related visual contents and suppress query-irrelevant ones. Besides, to fully match with different levels of contents in videos, we learn hierarchical and structural query clues to guide the visual content capturing. We disentangle sentence query into a semantics graph and capture the local contexts inside the graph via a trilinear model as query clues. Extensive experiments on Charades-STA and TACoS datasets demonstrate the effectiveness of our approach, and we achieve the state-of-the-art on the two datasets. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:72 / 86
页数:15
相关论文
共 52 条
[1]  
[Anonymous], 2020, AAAI CONF ARTIF INTE
[2]  
[Anonymous], 2018, NEUROCOMPUTING, DOI DOI 10.1016/J.NEUCOM.2018.06.069
[3]  
[Anonymous], 2019, NEUROCOMPUTING, DOI DOI 10.1016/J.NEUCOM.2018.11.042
[4]  
Ba L.J., ABS160706450 CORR
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]  
Chen J., 2018, JVET-J1002-v2, Patent No. 180304393
[7]  
Chen S., 2020, EUR C COMP VIS, P520
[8]  
Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[9]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171