A Discriminative Deep Model With Feature Fusion and Temporal Attention for Human Action Recognition

被引：33

作者：

Yu, Jiahui ^{[1
,2
]}

Gao, Hongwei ^{[1
]}

Yang, Wei ^{[1
]}

Jiang, Yueqiu ^{[1
]}

Chin, Weihong ^{[3
]}

Kubota, Naoyuki ^{[3
]}

Ju, Zhaojie ^{[2
]}

机构：

[1] Shenyang Ligong Univ, Sch Automat & Elect Engn, Shenyang 110159, Peoples R China

[2] Univ Portsmouth, Sch Comp, Portsmouth PO1 3HE, Hants, England

[3] Tokyo Metropolitan Univ, Grad Sch Syst Design, Tokyo 1910065, Japan

来源：

IEEE ACCESS | 2020年 / 8卷

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Real-time systems; Spatiotemporal phenomena; Streaming media; Skeleton; Dynamics; Hidden Markov models; Human action recognition; RGB-D; attention mode; real-time feature fusion; dataset; TRACKING; SYSTEM;

D O I：

10.1109/ACCESS.2020.2977856

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Activity recognition which aims to accurately distinguish human actions in complex environments plays a key role in human-robot/computer interaction. However, long-lasting and similar actions will cause poor feature sequence extraction and thus lead to a reduction of the recognition accuracy. We propose a novel discriminative deep model (D3D-LSTM) based on 3D-CNN and LSTM for both single-target and interaction action recognition to improve the spatiotemporal processing performance. Our models have several notable properties: 1) A real-time feature fusion method is used to obtain a more representative feature sequence through composition of local mixtures for enhancing the performance of discriminating similar actions; 2) We introduce an improved attention mechanism that focuses on each frame individually by assigning different weights in real-time; 3) An alternating optimization strategy is proposed for our model to obtain parameters with the best performance. Because the proposed D3D-LSTM model is efficient enough to be used as a detector that recognizes various activities, a Real-set database is collected to evaluate action recognition in complex real-world scenarios. For long-term relations, we update the present memory state via the weight-controlled attention module that enables the memory cell to store better long-term features. The densely connected bimodal modal makes local perceptrons of 3D-Conv motion-aware and stores better short-term features. The proposed D3D-LSTM model has been evaluated through a series of experiments on the Real-set and open-source datasets, i.e. SBU-Kinect and MSR-action-3D. Experimental results show that the proposed D3D-LSTM model achieves new state-of-the-art results, including pushing the average rate of the SBU-Kinect to 92.40% and the average rate of the MSR-action-3D to 95.40%.

引用

页码：43243 / 43255

页数：13

共 72 条

[1]

[Anonymous], 2012, PROC 6 INT S SUSTAIN

[2]

[Anonymous], 2014, Revised Selected Papers

[3]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[4]

[Anonymous], 2015, P 2015 C N AM CHAPT, DOI DOI 10.3115/V1/N15-1173

[5]

[Anonymous], 2013, P 23 INT JOINT C ART

[6]

[Anonymous], 2015, NEURAL INFORM PROCES

[7]

[Anonymous], 2012, Advances in Neural Information Processing Systems

[8]

Aydin R, 2014, IN C IND ENG ENG MAN, P1, DOI 10.1109/IEEM.2014.7058588

[9]

Ba J., 2014, Multiple object recognition with visual attention

[10]

Bahdanau D., 2014, ABS14090473 CORR

← 1 2 3 4 5 6 7 8 →