A Deep Sequence Learning Framework for Action Recognition in Small-Scale Depth Video Dataset

被引：2

作者：

Bulbul, Mohammad Farhad ^{[1
,2
]}

Ullah, Amin ^{[3
]}

Ali, Hazrat ^{[4
]}

Kim, Daijin ^{[1
]}

机构：

[1] Pohang Univ Sci & Technol POSTECH, Dept Comp Sci & Engn, 77 Cheongam, Pohang 37673, South Korea

[2] Jashore Univ Sci & Technol, Dept Math, Jashore 7408, Bangladesh

[3] Oregon State Univ, CORIS Inst, Corvallis, OR 97331 USA

[4] Hamad Bin Khalifa Univ, Qatar Fdn, Coll Sci & Engn, POB 34110, Doha, Qatar

来源：

SENSORS | 2022年 / 22卷 / 18期

关键词：

3D action recognition; depth map sequence; CNN; transfer learning; bi-directional LSTM; RNN; attention; BIDIRECTIONAL LSTM; FUSION; IMAGE; 2D;

D O I：

10.3390/s22186841

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Depth video sequence-based deep models for recognizing human actions are scarce compared to RGB and skeleton video sequences-based models. This scarcity limits the research advancements based on depth data, as training deep models with small-scale data is challenging. In this work, we propose a sequence classification deep model using depth video data for scenarios when the video data are limited. Unlike summarizing the frame contents of each frame into a single class, our method can directly classify a depth video, i.e., a sequence of depth frames. Firstly, the proposed system transforms an input depth video into three sequences of multi-view temporal motion frames. Together with the three temporal motion sequences, the input depth frame sequence offers a four-stream representation of the input depth action video. Next, the DenseNet121 architecture is employed along with ImageNet pre-trained weights to extract the discriminating frame-level action features of depth and temporal motion frames. The extracted four sets of feature vectors about frames of four streams are fed into four bi-directional (BLSTM) networks. The temporal features are further analyzed through multi-head self-attention (MHSA) to capture multi-view sequence correlations. Finally, the concatenated genre of their outputs is processed through dense layers to classify the input depth video. The experimental results on two small-scale benchmark depth datasets, MSRAction3D and DHA, demonstrate that the proposed framework is efficacious even for insufficient training samples and superior to the existing depth data-based action recognition methods.

引用

页数：22

共 84 条

[1] Deep Learning of Fuzzy Weighted Multi-Resolution Depth Motion Maps with Spatial Feature Fusion for Action Recognition
Al-Faris, Mahmoud
Chiverton, John
Yang, Yanyan
Ndzi, David
[J]. JOURNAL OF IMAGING, 2019, 5 (10)
[2] Al-Obaidi S., 2019, P 3 IET INT C TECHN P 3 IET INT C TECHN, P1
[3] A Vision-Based System for Intelligent Monitoring: Human Behaviour Analysis and Privacy by Context
Andre Chaaraoui, Alexandros
Ramon Padilla-Lopez, Jose
Javier Ferrandez-Pastor, Francisco
Nieto-Hidalgo, Mario
Florez-Revuelta, Francisco
[J]. SENSORS, 2014, 14 (05) : 8895 - 8925
[4] [Anonymous], 2012, P ACM INT C MULT
[5] [Anonymous], 2016, INT JOINT C ART INT
[6] Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos
Asadi-Aghbolaghi, Maryam
Kasaei, Shohreh
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (11) : 14115 - 14135
[7] Dynamic 3D Hand Gesture Recognition by Learning Weighted Depth Motion Maps
Azad, Reza
Asadi-Aghbolaghi, Maryam
Kasaei, Shohreh
Escalera, Sergio
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (06) : 1729 - 1740
[8] Bai Y., 2020, ARXIV
[9] The recognition of human movement using temporal templates
Bobick, AF
Davis, JW
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) : 257 - 267
[10] DMMs-Based Multiple Features Fusion for Human Action Recognition
Bulbul, Mohammad Farhad
Jiang, Yunsheng
Ma, Jinwen
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2015, 6 (04) : 23 - 39

← 1 2 3 4 5 6 7 8 9 →