Deep Neural Networks Using Capsule Networks and Skeleton-Based Attentions for Action Recognition

被引:18
作者
Ha, Manh-Hung [1 ]
Chen, Oscal Tzyh-Chiang [1 ]
机构
[1] Natl Chung Cheng Univ, Dept Elect Engn, Chiayi 62102, Taiwan
关键词
Deep neural network; convolutional neural network; recurrent neural network; capsule network; spatiotemporal attention; skeleton; action recognition;
D O I
10.1109/ACCESS.2020.3048741
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work develops Deep Neural Networks (DNNs) by adopting Capsule Networks (CapsNets) and spatiotemporal skeleton-based attention to effectively recognize subject actions from abundant spatial and temporal contexts of videos. The proposed generic DNN includes four 3D Convolutional Neural Networks (3D_CNNs), Attention-Jointed Appearance (AJA) and Attention-Jointed Motion (AJM) generation layers, two Reduction Layers (RLs), two Attention-based Recurrent Neural Networks (A_RNNs), and an inference classifier, where RGB, transformed skeleton, and optical-flow channel streams are inputs. The AJA and AJM generation layers emphasize skeletons to the appearances and motions of a subject, respectively. A_RNNs generate attention weights over time steps to highlight rich temporal contexts. To integrate CapsNets in this generic DNN, three types of CapsNet-based DNNs are devised, where the CapsNets take over a classifier, A_RNN+classifier, and RL+A_RNN+classifier. The experimental results reveal that the proposed DNN using CapsNet as an inference classifier outperforms the other two CapsNet-based DNNs and the generic DNN adopting the feedforward neural network as an inference classifier. Additionally, our best CapsNet-based DNN achieves average accuracies of 98.5% for the state-of-the-art performance in UCF101, 82.1% for near-state-of-the-art performance in HMDB51, and 95.3% for panoramic videos, to the best of our knowledge. Particularly, it is determined that the generic CapsNet behaves as an outstanding inference classifier but is slightly worse than the A_RNN in interpreting temporal evidence for recognition. Therefore, the proposed DNN, which employs CapsNet to fulfill an inference classifier, can be superiorly applied to various context-aware visual applications.
引用
收藏
页码:6164 / 6178
页数:15
相关论文
共 45 条
  • [1] [Anonymous], 2015, ARXIV150307274
  • [2] [Anonymous], 2012, CoRR
  • [3] SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition
    Caetano, Carlos
    Sena, Jessica
    Bremond, Francois
    dos Santos, Jefersson A.
    Schwartz, William Robson
    [J]. 2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
  • [4] Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints
    Caetano, Carlos
    Bremond, Francois
    Schwartz, William Robson
    [J]. 2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, : 16 - 23
  • [5] OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
    Cao, Zhe
    Hidalgo, Gines
    Simon, Tomas
    Wei, Shih-En
    Sheikh, Yaser
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) : 172 - 186
  • [6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [7] Chen Oscal T.-C, 2018, 2018 10th International Conference on Knowledge and Smart Technology (KST), P242, DOI 10.1109/KST.2018.8426164
  • [8] Chen O. T.-C., 2016, P ISG 10 WORLD C GEO, p9S
  • [9] Chen O.T.-C., 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS), P1
  • [10] Chen OTC, 2017, 2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS)