A Discriminative Deep Model With Feature Fusion and Temporal Attention for Human Action Recognition

被引:33
作者
Yu, Jiahui [1 ,2 ]
Gao, Hongwei [1 ]
Yang, Wei [1 ]
Jiang, Yueqiu [1 ]
Chin, Weihong [3 ]
Kubota, Naoyuki [3 ]
Ju, Zhaojie [2 ]
机构
[1] Shenyang Ligong Univ, Sch Automat & Elect Engn, Shenyang 110159, Peoples R China
[2] Univ Portsmouth, Sch Comp, Portsmouth PO1 3HE, Hants, England
[3] Tokyo Metropolitan Univ, Grad Sch Syst Design, Tokyo 1910065, Japan
基金
中国国家自然科学基金;
关键词
Feature extraction; Real-time systems; Spatiotemporal phenomena; Streaming media; Skeleton; Dynamics; Hidden Markov models; Human action recognition; RGB-D; attention mode; real-time feature fusion; dataset; TRACKING; SYSTEM;
D O I
10.1109/ACCESS.2020.2977856
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Activity recognition which aims to accurately distinguish human actions in complex environments plays a key role in human-robot/computer interaction. However, long-lasting and similar actions will cause poor feature sequence extraction and thus lead to a reduction of the recognition accuracy. We propose a novel discriminative deep model (D3D-LSTM) based on 3D-CNN and LSTM for both single-target and interaction action recognition to improve the spatiotemporal processing performance. Our models have several notable properties: 1) A real-time feature fusion method is used to obtain a more representative feature sequence through composition of local mixtures for enhancing the performance of discriminating similar actions; 2) We introduce an improved attention mechanism that focuses on each frame individually by assigning different weights in real-time; 3) An alternating optimization strategy is proposed for our model to obtain parameters with the best performance. Because the proposed D3D-LSTM model is efficient enough to be used as a detector that recognizes various activities, a Real-set database is collected to evaluate action recognition in complex real-world scenarios. For long-term relations, we update the present memory state via the weight-controlled attention module that enables the memory cell to store better long-term features. The densely connected bimodal modal makes local perceptrons of 3D-Conv motion-aware and stores better short-term features. The proposed D3D-LSTM model has been evaluated through a series of experiments on the Real-set and open-source datasets, i.e. SBU-Kinect and MSR-action-3D. Experimental results show that the proposed D3D-LSTM model achieves new state-of-the-art results, including pushing the average rate of the SBU-Kinect to 92.40% and the average rate of the MSR-action-3D to 95.40%.
引用
收藏
页码:43243 / 43255
页数:13
相关论文
共 72 条
[1]  
[Anonymous], 2012, PROC 6 INT S SUSTAIN
[2]  
[Anonymous], 2014, Revised Selected Papers
[3]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[4]  
[Anonymous], 2015, P 2015 C N AM CHAPT, DOI DOI 10.3115/V1/N15-1173
[5]  
[Anonymous], 2013, P 23 INT JOINT C ART
[6]  
[Anonymous], 2015, NEURAL INFORM PROCES
[7]  
[Anonymous], 2012, Advances in Neural Information Processing Systems
[8]  
Aydin R, 2014, IN C IND ENG ENG MAN, P1, DOI 10.1109/IEEM.2014.7058588
[9]  
Ba J., 2014, Multiple object recognition with visual attention
[10]  
Bahdanau D., 2014, ABS14090473 CORR