Host-Parasite: Graph LSTM-in-LSTM for Group Activity Recognition

被引：183

作者：

Shu, Xiangbo ^{[1
]}

Zhang, Liyan ^{[2
]}

Sun, Yunlian ^{[1
]}

Tang, Jinhui ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China

[2] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210016, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2021年 / 32卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Activity recognition; Legged locomotion; Feature extraction; Solid modeling; Spatiotemporal phenomena; Computer architecture; Learning systems; Deep learning; graph LSTM (G-LSTM); group activity recognition; long short-term memory (LSTM); HISTOGRAMS;

D O I：

10.1109/TNNLS.2020.2978942

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This article aims to tackle the problem of group activity recognition in the multiple-person scene. To model the group activity with multiple persons, most long short-term memory (LSTM)-based methods first learn the person-level action representations by several LSTMs and then integrate all the person-level action representations into the following LSTM to learn the group-level activity representation. This type of solution is a two-stage strategy, which neglects the "host-parasite" relationship between the group-level activity ("host") and person-level actions ("parasite") in spatiotemporal space. To this end, we propose a novel graph LSTM-in-LSTM (GLIL) for group activity recognition by modeling the person-level actions and the group-level activity simultaneously. GLIL is a "host-parasite" architecture, which can be seen as several person LSTMs (P-LSTMs) in the local view or a graph LSTM (G-LSTM) in the global view. Specifically, P-LSTMs model the person-level actions based on the interactions among persons. Meanwhile, G-LSTM models the group-level activity, where the person-level motion information in multiple P-LSTMs is selectively integrated and stored into G-LSTM based on their contributions to the inference of the group activity class. Furthermore, to use the person-level temporal features instead of the person-level static features as the input of GLIL, we introduce a residual LSTM with the residual connection to learn the person-level residual features, consisting of temporal features and static features. Experimental results on two public data sets illustrate the effectiveness of the proposed GLIL compared with state-of-the-art methods.

引用

页码：663 / 674

页数：12

共 64 条

[1]

Amer MR, 2012, LECT NOTES COMPUT SC, V7575, P187, DOI 10.1007/978-3-642-33765-9_14

[2]

[Anonymous], 2019, ARXIV190913245

[3]

[Anonymous], 2017, P IEEE C COMP VIS PA

[4]

[Anonymous], 2015, arXiv preprint arXiv:1501.05964

[5]

[Anonymous], 2016, ARXIV160504988

[6]

Antic B., 2014, P EUR C COMP VIS ECC, P33

[7]

Azar S. M., 2019, PROC CVPR IEEE, P7892, DOI DOI 10.1109/CVPR.2019.00808

[8] Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition [J].

Bagautdinov, Timur ;

Alahi, Alexandre ;

Fleuret, Francois ;

Fua, Pascal ;

Savarese, Silvio .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3425-3434

[9] Structural Recurrent Neural Network (SRNN) for Group Activity Analysis [J].

Biswas, Sovan ;

Gall, Juergen .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :1615-1622

[10] Learning Person-Person Interaction in Collective Activity Recognition [J].

Chang, Xiaobin ;

Zheng, Wei-Shi ;

Zhang, Jianguo .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (06) :1905-1918

← 1 2 3 4 5 6 7 →