Zoom Transformer for Skeleton-Based Group Activity Recognition

被引:46
作者
Zhang, Jiaxu [1 ]
Jia, Yifan [2 ]
Xie, Wei [3 ]
Tu, Zhigang [1 ]
机构
[1] Wuhan Univ, State Key Lab Informat Engn Surveying Mapping & Re, Wuhan 430079, Peoples R China
[2] Wuhan Univ, Renmin Hosp, Dept Pain, Wuhan 430060, Peoples R China
[3] Cent China Normal Univ, Sch Comp, Wuhan 430079, Peoples R China
基金
中国国家自然科学基金;
关键词
Skeleton; Transformers; Feature extraction; Activity recognition; Data mining; Visualization; Task analysis; skeleton-based action; visual transformer; attention mechanism;
D O I
10.1109/TCSVT.2022.3193574
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Skeleton-based human action recognition has attracted increasing attention and many methods have been proposed to boost the performance. However, these methods still confront three main limitations: 1) Focusing on single-person action recognition while neglecting the group activity of multiple people (more than 5 people). In practice, multi-person group activity recognition via skeleton data is also a meaningful problem. 2) Unable to mine high-level semantic information from the skeleton data, such as interactions among multiple people and their positional relationships. 3) Existing datasets used for multi-person group activity recognition are all RGB videos involved, which cannot be directly applied to skeleton-based group activity analysis. To address these issues, we propose a novel Zoom Transformer to exploit both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps. Besides, we estimate the multi-person skeletons from the existing real-world video datasets i.e. Kinetics and Volleyball-Activity, and release two new benchmarks to verify the effectiveness of our Zoom Transfromer. Extensive experiments demonstrate that our model can effectively cope with the skeleton-based multi-person group activity. Additionally, experiments on the large-scale NTU-RGB+D dataset validate that our model also achieves remarkable performance for single-person action recognition. The code and the skeleton data are publicly available at https://github.com/Kebii/Zoom-Transformer
引用
收藏
页码:8646 / 8659
页数:14
相关论文
共 67 条
[1]  
Amer MR, 2014, LECT NOTES COMPUT SC, V8694, P572, DOI 10.1007/978-3-319-10599-4_37
[2]   Convolutional Relational Machine for Group Activity Recognition [J].
Azar, Sina Mokhtarzadeh ;
Atigh, Mina Ghadimi ;
Nickabadi, Ahmad ;
Alahi, Alexandre .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7884-7893
[3]   Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition [J].
Banerjee, Avinandan ;
Singh, Pawan Kumar ;
Sarkar, Ram .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) :2206-2216
[4]   SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].
Caetano, Carlos ;
Sena, Jessica ;
Bremond, Francois ;
dos Santos, Jefersson A. ;
Schwartz, William Robson .
2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
[5]   Skeleton-Based Action Recognition With Gated Convolutional Neural Networks [J].
Cao, Congqi ;
Lan, Cuiling ;
Zhang, Yifan ;
Zeng, Wenjun ;
Lu, Hanqing ;
Zhang, Yanning .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (11) :3247-3257
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Clustering Driven Deep Autoencoder for Video Anomaly Detection [J].
Chang, Yunpeng ;
Tu, Zhigang ;
Xie, Wei ;
Yuan, Junsong .
COMPUTER VISION - ECCV 2020, PT XV, 2020, 12360 :329-345
[8]   Pre-Trained Image Processing Transformer [J].
Chen, Hanting ;
Wang, Yunhe ;
Guo, Tianyu ;
Xu, Chang ;
Deng, Yiping ;
Liu, Zhenhua ;
Ma, Siwei ;
Xu, Chunjing ;
Xu, Chao ;
Gao, Wen .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305
[9]   3D Human Motion Reconstruction in Unity with Monocular Camera [J].
Chen, Tai-Wei ;
Lin, Wei-Liang .
2020 17TH INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC 2020), 2020, :191-192
[10]   Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition [J].
Cheng, Ke ;
Zhang, Yifan ;
Cao, Congqi ;
Shi, Lei ;
Cheng, Jian ;
Lu, Hanqing .
COMPUTER VISION - ECCV 2020, PT XXIV, 2020, 12369 :536-553