Multi-stream 3D CNN structure for human action recognition trained by limited data

被引:26
作者
Chenarlogh, Vahid Ashkani [1 ]
Razzazi, Farbod [1 ]
机构
[1] Islamic Azad Univ, Sci & Res Branch, Dept Elect & Comp Engn, Tehran, Iran
关键词
object recognition; image motion analysis; image classification; cameras; feature extraction; learning (artificial intelligence); video signal processing; image sequences; convolutional neural nets; multistream 3D CNN structure; human action recognition; training performance; training data case; optical flows; vertical directions; three-dimensional CNNs; four-stream 3D CNNs; single-stream model; two-stream architecture; four-stream architecture; information channels; separate streams; action recognition system; data set; four-stream structure; convolutional neural network architectures; optical flow; recognition rate; IXMAS; FEATURES;
D O I
10.1049/iet-cvi.2018.5088
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Here, the authors proposed a solution to improve the training performance in limited training data case for human action recognition. The authors proposed three different convolutional neural network (CNN) architectures for this purpose. At first, the authors generated four different channels of information by optical flows and gradients in the horizontal and vertical directions from each frame to apply to three-dimensional (3D) CNNs. Then, the authors proposed three architectures, which are single-stream, two-stream, and four-stream 3D CNNs. In the single-stream model, the authors applied four channels of information from each frame to a single stream. In the two-stream architecture, the authors applied optical flow-x and optical flow-y into one stream and gradient-x and gradient-y to another stream. In the four-stream architecture, the authors applied each one of the information channels to four separate streams. Evaluating the architectures in an action recognition system, the system was assessed on IXMAS, a data set which has been recorded simultaneously by five cameras. The authors showed that the results of four-stream architecture were better than other architectures, achieving 87.5, 91.66, 91.11, 88.05, and 81.94% recognition rates for cameras 0-4, respectively, using four-stream structure (88.05% recognition rate in average).
引用
收藏
页码:338 / 344
页数:7
相关论文
共 34 条
[1]  
Aires KRT, 2008, APPLIED COMPUTING 2008, VOLS 1-3, P1607
[2]  
Chen CH, 2013, CHINA COMMUN, V10, P139, DOI 10.1109/CC.2013.6723886
[3]  
Dahl GE, 2013, INT CONF ACOUST SPEE, P8609, DOI 10.1109/ICASSP.2013.6639346
[4]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[5]   A Latent Model of Discriminative Aspect [J].
Farhadi, Ali ;
Tabrizi, Mostafa Kamali ;
Endres, Ian ;
Forsyth, David .
2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2009, :948-955
[6]   Multiple deep features learning for object retrieval in surveillance videos [J].
Guo, Haiyun ;
Wang, Jinqiao ;
Lu, Hanqing .
IET COMPUTER VISION, 2016, 10 (04) :268-272
[7]   Deep learning for visual understanding: A review [J].
Guo, Yanming ;
Liu, Yu ;
Oerlemans, Ard ;
Lao, Songyang ;
Wu, Song ;
Lew, Michael S. .
NEUROCOMPUTING, 2016, 187 :27-48
[8]   Moving object recognition using multi-view three-dimensional convolutional neural networks [J].
He, Tao ;
Mao, Hua ;
Yi, Zhang .
NEURAL COMPUTING & APPLICATIONS, 2017, 28 (12) :3827-3835
[9]   Human Pose Estimation and Activity Recognition From Multi-View Videos: Comparative Explorations of Recent Developments [J].
Holte, Michael B. ;
Cuong Tran ;
Trivedi, Mohan M. ;
Moeslund, Thomas B. .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2012, 6 (05) :538-552
[10]  
Ioffe S, 2015, PR MACH LEARN RES, V37, P448