Human Action Recognition From Various Data Modalities: A Review

被引:384
作者
Sun, Zehua [1 ]
Ke, Qiuhong [2 ]
Rahmani, Hossein [3 ]
Bennamoun, Mohammed [4 ]
Wang, Gang [5 ]
Liu, Jun [1 ]
机构
[1] Singapore Univ Technol & Design, Singapore 487372, Singapore
[2] Monash Univ, Clayton, Vic 3800, Australia
[3] Univ Lancaster, Lancaster LA1 4YW, England
[4] Univ Western Australia, Crawley, WA 6009, Australia
[5] Alibaba Grp, Hangzhou 310052, Zhejiang, Peoples R China
基金
新加坡国家研究基金会;
关键词
Feature extraction; Visualization; Skeleton; Optical imaging; Deep learning; Three-dimensional displays; Radar; Human action recognition; deep learning; data modality; single modality; multi-modality; CONVOLUTIONAL NEURAL-NETWORKS; FREE WIRELESS LOCALIZATION; RGB-D; BIDIRECTIONAL LSTM; ACCELEROMETER DATA; MOTION; DEPTH; CNN; ENSEMBLE; FUSION;
D O I
10.1109/TPAMI.2022.3183112
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this article, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.
引用
收藏
页码:3200 / 3225
页数:26
相关论文
共 425 条
[1]  
Abu-El-Haija S., 2016, YouTube-8m: A Large-Scale Video Classification Benchmark
[2]   Human Activity Analysis: A Review [J].
Aggarwal, J. K. ;
Ryoo, M. S. .
ACM COMPUTING SURVEYS, 2011, 43 (03)
[3]   CNN-Based Multistage Gated Average Fusion (MGAF) for Human Action Recognition Using Depth and Inertial Sensors [J].
Ahmad, Zeeshan ;
Khan, Naimul .
IEEE SENSORS JOURNAL, 2021, 21 (03) :3623-3634
[4]   Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors [J].
Ahmad, Zeeshan ;
Khan, Naimul .
IEEE SENSORS JOURNAL, 2020, 20 (03) :1445-1455
[5]   Deep learning approach for human action recognition in infrared images [J].
Akula, Aparna ;
Shah, Anuj K. ;
Ghosh, Ripul .
COGNITIVE SYSTEMS RESEARCH, 2018, 50 :146-154
[6]   Unsupervised Learning from Narrated Instruction Videos [J].
Alayrac, Jean-Baptiste ;
Bojanowski, Piotr ;
Agrawal, Nishant ;
Sivic, Josef ;
Laptev, Ivan ;
Lacoste-Julien, Simon .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4575-4583
[7]  
Alwassel H, 2020, Arxiv, DOI arXiv:1911.12667
[8]   A Low Power, Fully Event-Based Gesture Recognition System [J].
Amir, Arnon ;
Taba, Brian ;
Berg, David ;
Melano, Timothy ;
McKinstry, Jeffrey ;
Di Nolfo, Carmelo ;
Nayak, Tapan ;
Andreopoulos, Alexander ;
Garreau, Guillaume ;
Mendoza, Marcela ;
Kusnitz, Jeff ;
Debole, Michael ;
Esser, Steve ;
Delbruck, Tobi ;
Flickner, Myron ;
Modha, Dharmendra .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7388-7397
[9]  
[Anonymous], 1998, Encyclopedia of Electrical and Electronics Engineering
[10]  
[Anonymous], 2015, THUMOS challenge: action recognition with a large number of classes