Deep Neural Networks Using Capsule Networks and Skeleton-Based Attentions for Action Recognition

被引：20

作者：

Ha, Manh-Hung ^{[1
]}

Chen, Oscal Tzyh-Chiang ^{[1
]}

机构：

[1] Natl Chung Cheng Univ, Dept Elect Engn, Chiayi 62102, Taiwan

来源：

IEEE ACCESS | 2021年 / 9卷

关键词：

Deep neural network; convolutional neural network; recurrent neural network; capsule network; spatiotemporal attention; skeleton; action recognition;

D O I：

10.1109/ACCESS.2020.3048741

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This work develops Deep Neural Networks (DNNs) by adopting Capsule Networks (CapsNets) and spatiotemporal skeleton-based attention to effectively recognize subject actions from abundant spatial and temporal contexts of videos. The proposed generic DNN includes four 3D Convolutional Neural Networks (3D_CNNs), Attention-Jointed Appearance (AJA) and Attention-Jointed Motion (AJM) generation layers, two Reduction Layers (RLs), two Attention-based Recurrent Neural Networks (A_RNNs), and an inference classifier, where RGB, transformed skeleton, and optical-flow channel streams are inputs. The AJA and AJM generation layers emphasize skeletons to the appearances and motions of a subject, respectively. A_RNNs generate attention weights over time steps to highlight rich temporal contexts. To integrate CapsNets in this generic DNN, three types of CapsNet-based DNNs are devised, where the CapsNets take over a classifier, A_RNN+classifier, and RL+A_RNN+classifier. The experimental results reveal that the proposed DNN using CapsNet as an inference classifier outperforms the other two CapsNet-based DNNs and the generic DNN adopting the feedforward neural network as an inference classifier. Additionally, our best CapsNet-based DNN achieves average accuracies of 98.5% for the state-of-the-art performance in UCF101, 82.1% for near-state-of-the-art performance in HMDB51, and 95.3% for panoramic videos, to the best of our knowledge. Particularly, it is determined that the generic CapsNet behaves as an outstanding inference classifier but is slightly worse than the A_RNN in interpreting temporal evidence for recognition. Therefore, the proposed DNN, which employs CapsNet to fulfill an inference classifier, can be superiorly applied to various context-aware visual applications.

引用

页码：6164 / 6178

页数：15

共 45 条

[1]

[Anonymous], 2015, ARXIV150307274

[2]

[Anonymous], 2012, CoRR

[3] SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition [J].

Caetano, Carlos ;

Sena, Jessica ;

Bremond, Francois ;

dos Santos, Jefersson A. ;

Schwartz, William Robson .

2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,

[4] Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints [J].

Caetano, Carlos ;

Bremond, Francois ;

Schwartz, William Robson .

2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, :16-23

[5] OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].

Cao, Zhe ;

Hidalgo, Gines ;

Simon, Tomas ;

Wei, Shih-En ;

Sheikh, Yaser .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186

[6] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[7]

Chen Oscal T.-C, 2018, 2018 10th International Conference on Knowledge and Smart Technology (KST), P242, DOI 10.1109/KST.2018.8426164

[8]

Chen O. T.-C., 2016, P ISG 10 WORLD C GEO, p9S

[9]

Chen O.T.-C., 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS), P1

[10]

Chen OTC, 2017, 2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS)

← 1 2 3 4 5 →