Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

被引:1
作者
Qi, Shanshan [1 ]
Yang, Luxi [1 ]
Li, Chunguo [1 ]
Huang, Yongming [1 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing 211189, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Grounding; Visualization; Task analysis; Logic gates; Proposals; Convolution; Temporal sentence grounding; coarse-grained crucial frame selection; fine-grained spatial-temporal relationship matching; gated graph convolution network; LOCALIZATION; VIDEOS; ATTENTION; PROPOSAL; VLAD;
D O I
10.1109/ACCESS.2021.3095229
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatial-temporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.
引用
收藏
页码:97430 / 97443
页数:14
相关论文
共 58 条
[1]  
[Anonymous], 2013, T ASSOC COMPUT LING
[2]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]  
Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[5]  
Chung J., 2014, ARXIV
[6]   Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization [J].
Liu, Daizong ;
Qu, Xiaoye ;
Liu, Xiao-Yang ;
Dong, Jianfeng ;
Zhou, Pan ;
Xu, Zichuan .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :4070-4078
[7]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]  
Duchi J, 2011, J MACH LEARN RES, V12, P2121
[10]   Video Re-localization [J].
Feng, Yang ;
Ma, Lin ;
Liu, Wei ;
Zhang, Tong ;
Luo, Jiebo .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :55-70