Host-Parasite: Graph LSTM-in-LSTM for Group Activity Recognition

被引:183
作者
Shu, Xiangbo [1 ]
Zhang, Liyan [2 ]
Sun, Yunlian [1 ]
Tang, Jinhui [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210016, Peoples R China
基金
中国国家自然科学基金;
关键词
Activity recognition; Legged locomotion; Feature extraction; Solid modeling; Spatiotemporal phenomena; Computer architecture; Learning systems; Deep learning; graph LSTM (G-LSTM); group activity recognition; long short-term memory (LSTM); HISTOGRAMS;
D O I
10.1109/TNNLS.2020.2978942
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article aims to tackle the problem of group activity recognition in the multiple-person scene. To model the group activity with multiple persons, most long short-term memory (LSTM)-based methods first learn the person-level action representations by several LSTMs and then integrate all the person-level action representations into the following LSTM to learn the group-level activity representation. This type of solution is a two-stage strategy, which neglects the "host-parasite" relationship between the group-level activity ("host") and person-level actions ("parasite") in spatiotemporal space. To this end, we propose a novel graph LSTM-in-LSTM (GLIL) for group activity recognition by modeling the person-level actions and the group-level activity simultaneously. GLIL is a "host-parasite" architecture, which can be seen as several person LSTMs (P-LSTMs) in the local view or a graph LSTM (G-LSTM) in the global view. Specifically, P-LSTMs model the person-level actions based on the interactions among persons. Meanwhile, G-LSTM models the group-level activity, where the person-level motion information in multiple P-LSTMs is selectively integrated and stored into G-LSTM based on their contributions to the inference of the group activity class. Furthermore, to use the person-level temporal features instead of the person-level static features as the input of GLIL, we introduce a residual LSTM with the residual connection to learn the person-level residual features, consisting of temporal features and static features. Experimental results on two public data sets illustrate the effectiveness of the proposed GLIL compared with state-of-the-art methods.
引用
收藏
页码:663 / 674
页数:12
相关论文
共 64 条
[1]  
Amer MR, 2012, LECT NOTES COMPUT SC, V7575, P187, DOI 10.1007/978-3-642-33765-9_14
[2]  
[Anonymous], 2019, ARXIV190913245
[3]  
[Anonymous], 2017, P IEEE C COMP VIS PA
[4]  
[Anonymous], 2015, arXiv preprint arXiv:1501.05964
[5]  
[Anonymous], 2016, ARXIV160504988
[6]  
Antic B., 2014, P EUR C COMP VIS ECC, P33
[7]  
Azar S. M., 2019, PROC CVPR IEEE, P7892, DOI DOI 10.1109/CVPR.2019.00808
[8]   Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition [J].
Bagautdinov, Timur ;
Alahi, Alexandre ;
Fleuret, Francois ;
Fua, Pascal ;
Savarese, Silvio .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3425-3434
[9]   Structural Recurrent Neural Network (SRNN) for Group Activity Analysis [J].
Biswas, Sovan ;
Gall, Juergen .
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :1615-1622
[10]   Learning Person-Person Interaction in Collective Activity Recognition [J].
Chang, Xiaobin ;
Zheng, Wei-Shi ;
Zhang, Jianguo .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (06) :1905-1918