Videos as Space-Time Region Graphs

被引:487
作者
Wang, Xiaolong [1 ]
Gupta, Abhinav [1 ]
机构
[1] Carnegie Mellon Univ, Robot Inst, Pittsburgh, PA 15213 USA
来源
COMPUTER VISION - ECCV 2018, PT V | 2018年 / 11209卷
关键词
D O I
10.1007/978-3-030-01228-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How do humans recognize the action "opening a book"? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on the Charades and Something-Something datasets. Especially for Charades with complex environments, we obtain a huge 4.4% gain when our model is applied in complex environments.
引用
收藏
页码:413 / 431
页数:19
相关论文
共 91 条
[1]   Joint Discovery of Object States and Manipulation Actions [J].
Alayrac, Jean-Baptiste ;
Sivic, Josef ;
Laptev, Ivan ;
Lacoste-Julien, Simon .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2146-2155
[2]  
[Anonymous], 2017, ARXIV170604261
[3]  
[Anonymous], 2012, Technical Report
[4]  
[Anonymous], 2017, ARXIV170704555
[5]  
[Anonymous], 1989, Stat. Sci., DOI DOI 10.1214/SS/1177012582
[6]  
[Anonymous], 2001, PROC 18 INT C MACH L
[7]  
[Anonymous], 2017, NIPS
[8]  
[Anonymous], 2012, CVPR
[9]  
[Anonymous], 2017, ARXIV PREPRINT ARXIV
[10]  
[Anonymous], 2017, ARXIV170803805