Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

被引:24
作者
Zhang, Zongmeng [1 ]
Han, Xianjing [1 ]
Song, Xuemeng [1 ]
Yan, Yan [2 ]
Nie, Liqiang [1 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
[2] IIT, Dept Comp Sci, Chicago, IL 60616 USA
基金
中国国家自然科学基金;
关键词
Videos; Location awareness; Task analysis; Semantics; Syntactics; Convolution; Cognition; Temporal language localization; graph convolutional network; video and language; NEURAL-NETWORK;
D O I
10.1109/TIP.2021.3113791
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.
引用
收藏
页码:8265 / 8277
页数:13
相关论文
共 66 条
  • [1] Long short-term memory
    Hochreiter, S
    Schmidhuber, J
    [J]. NEURAL COMPUTATION, 1997, 9 (08) : 1735 - 1780
  • [2] G3RAPHGROUND: Graph-based Language Grounding
    Bajaj, Mohit
    Wang, Lanjun
    Sigal, Leonid
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4280 - 4289
  • [3] Battaglia PW, 2018, ARXIV PREPRINT ARXIV
  • [4] Graph-Based Spatio-Temporal Feature Learning for Neuromorphic Vision Sensing
    Bi, Yin
    Chadha, Aaron
    Abbas, Alhabib
    Bourtsoulatze, Eirina
    Andreopoulos, Yiannis
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 9084 - 9098
  • [5] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [7] Chaplot DS, 2018, AAAI CONF ARTIF INTE, P2819
  • [8] Scene Recognition With Prototype-Agnostic Scene Layout
    Chen, Gongwei
    Song, Xinhang
    Zeng, Haitao
    Jiang, Shuqiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 5877 - 5888
  • [9] Chen JY, 2019, AAAI CONF ARTIF INTE, P8175
  • [10] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162