A Deep Sequence Learning Framework for Action Recognition in Small-Scale Depth Video Dataset

被引:2
作者
Bulbul, Mohammad Farhad [1 ,2 ]
Ullah, Amin [3 ]
Ali, Hazrat [4 ]
Kim, Daijin [1 ]
机构
[1] Pohang Univ Sci & Technol POSTECH, Dept Comp Sci & Engn, 77 Cheongam, Pohang 37673, South Korea
[2] Jashore Univ Sci & Technol, Dept Math, Jashore 7408, Bangladesh
[3] Oregon State Univ, CORIS Inst, Corvallis, OR 97331 USA
[4] Hamad Bin Khalifa Univ, Qatar Fdn, Coll Sci & Engn, POB 34110, Doha, Qatar
关键词
3D action recognition; depth map sequence; CNN; transfer learning; bi-directional LSTM; RNN; attention; BIDIRECTIONAL LSTM; FUSION; IMAGE; 2D;
D O I
10.3390/s22186841
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Depth video sequence-based deep models for recognizing human actions are scarce compared to RGB and skeleton video sequences-based models. This scarcity limits the research advancements based on depth data, as training deep models with small-scale data is challenging. In this work, we propose a sequence classification deep model using depth video data for scenarios when the video data are limited. Unlike summarizing the frame contents of each frame into a single class, our method can directly classify a depth video, i.e., a sequence of depth frames. Firstly, the proposed system transforms an input depth video into three sequences of multi-view temporal motion frames. Together with the three temporal motion sequences, the input depth frame sequence offers a four-stream representation of the input depth action video. Next, the DenseNet121 architecture is employed along with ImageNet pre-trained weights to extract the discriminating frame-level action features of depth and temporal motion frames. The extracted four sets of feature vectors about frames of four streams are fed into four bi-directional (BLSTM) networks. The temporal features are further analyzed through multi-head self-attention (MHSA) to capture multi-view sequence correlations. Finally, the concatenated genre of their outputs is processed through dense layers to classify the input depth video. The experimental results on two small-scale benchmark depth datasets, MSRAction3D and DHA, demonstrate that the proposed framework is efficacious even for insufficient training samples and superior to the existing depth data-based action recognition methods.
引用
收藏
页数:22
相关论文
共 84 条
  • [1] Deep Learning of Fuzzy Weighted Multi-Resolution Depth Motion Maps with Spatial Feature Fusion for Action Recognition
    Al-Faris, Mahmoud
    Chiverton, John
    Yang, Yanyan
    Ndzi, David
    [J]. JOURNAL OF IMAGING, 2019, 5 (10)
  • [2] Al-Obaidi S., 2019, P 3 IET INT C TECHN P 3 IET INT C TECHN, P1
  • [3] A Vision-Based System for Intelligent Monitoring: Human Behaviour Analysis and Privacy by Context
    Andre Chaaraoui, Alexandros
    Ramon Padilla-Lopez, Jose
    Javier Ferrandez-Pastor, Francisco
    Nieto-Hidalgo, Mario
    Florez-Revuelta, Francisco
    [J]. SENSORS, 2014, 14 (05) : 8895 - 8925
  • [4] [Anonymous], 2012, P ACM INT C MULT
  • [5] [Anonymous], 2016, INT JOINT C ART INT
  • [6] Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos
    Asadi-Aghbolaghi, Maryam
    Kasaei, Shohreh
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (11) : 14115 - 14135
  • [7] Dynamic 3D Hand Gesture Recognition by Learning Weighted Depth Motion Maps
    Azad, Reza
    Asadi-Aghbolaghi, Maryam
    Kasaei, Shohreh
    Escalera, Sergio
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (06) : 1729 - 1740
  • [8] Bai Y., 2020, ARXIV
  • [9] The recognition of human movement using temporal templates
    Bobick, AF
    Davis, JW
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) : 257 - 267
  • [10] DMMs-Based Multiple Features Fusion for Human Action Recognition
    Bulbul, Mohammad Farhad
    Jiang, Yunsheng
    Ma, Jinwen
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2015, 6 (04) : 23 - 39