Deep Neural Networks Using Capsule Networks and Skeleton-Based Attentions for Action Recognition

被引:20
作者
Ha, Manh-Hung [1 ]
Chen, Oscal Tzyh-Chiang [1 ]
机构
[1] Natl Chung Cheng Univ, Dept Elect Engn, Chiayi 62102, Taiwan
关键词
Deep neural network; convolutional neural network; recurrent neural network; capsule network; spatiotemporal attention; skeleton; action recognition;
D O I
10.1109/ACCESS.2020.3048741
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work develops Deep Neural Networks (DNNs) by adopting Capsule Networks (CapsNets) and spatiotemporal skeleton-based attention to effectively recognize subject actions from abundant spatial and temporal contexts of videos. The proposed generic DNN includes four 3D Convolutional Neural Networks (3D_CNNs), Attention-Jointed Appearance (AJA) and Attention-Jointed Motion (AJM) generation layers, two Reduction Layers (RLs), two Attention-based Recurrent Neural Networks (A_RNNs), and an inference classifier, where RGB, transformed skeleton, and optical-flow channel streams are inputs. The AJA and AJM generation layers emphasize skeletons to the appearances and motions of a subject, respectively. A_RNNs generate attention weights over time steps to highlight rich temporal contexts. To integrate CapsNets in this generic DNN, three types of CapsNet-based DNNs are devised, where the CapsNets take over a classifier, A_RNN+classifier, and RL+A_RNN+classifier. The experimental results reveal that the proposed DNN using CapsNet as an inference classifier outperforms the other two CapsNet-based DNNs and the generic DNN adopting the feedforward neural network as an inference classifier. Additionally, our best CapsNet-based DNN achieves average accuracies of 98.5% for the state-of-the-art performance in UCF101, 82.1% for near-state-of-the-art performance in HMDB51, and 95.3% for panoramic videos, to the best of our knowledge. Particularly, it is determined that the generic CapsNet behaves as an outstanding inference classifier but is slightly worse than the A_RNN in interpreting temporal evidence for recognition. Therefore, the proposed DNN, which employs CapsNet to fulfill an inference classifier, can be superiorly applied to various context-aware visual applications.
引用
收藏
页码:6164 / 6178
页数:15
相关论文
共 45 条
[1]  
[Anonymous], 2015, ARXIV150307274
[2]  
[Anonymous], 2012, CoRR
[3]   SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].
Caetano, Carlos ;
Sena, Jessica ;
Bremond, Francois ;
dos Santos, Jefersson A. ;
Schwartz, William Robson .
2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
[4]   Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints [J].
Caetano, Carlos ;
Bremond, Francois ;
Schwartz, William Robson .
2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, :16-23
[5]   OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].
Cao, Zhe ;
Hidalgo, Gines ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]  
Chen Oscal T.-C, 2018, 2018 10th International Conference on Knowledge and Smart Technology (KST), P242, DOI 10.1109/KST.2018.8426164
[8]  
Chen O. T.-C., 2016, P ISG 10 WORLD C GEO, p9S
[9]  
Chen O.T.-C., 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS), P1
[10]  
Chen OTC, 2017, 2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS)