Multi-scale spatialtemporal information deep fusion network with temporal pyramid mechanism for video action recognition

被引:6
作者
Ou, Hongshi [1 ]
Sun, Jifeng [1 ]
机构
[1] South China Univ Technol, Sch Elect & Informat Engn, 381 Wushan Rd, Guangzhou 510641, Peoples R China
关键词
Video action recognition; spatial-temporal information deep fusion; deep learning;
D O I
10.3233/JIFS-189714
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the deep learning-based video action recognitio, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. This paper puts forward a network extracting video sequence semantic information based on deep integration of local Spatial-Temporal information. The network uses 2D Convolutional Neural Network (2DCNN) and Multi Spatial-Temporal scale 3D Convolutional Neural Network (MST_3DCNN) respectively to extract spatial information and motion information. Spatial information and motion information of the same time quantum receive 3D convolutional integration to generate the temporary Spatial-Temporal information of a certain moment. Then, the Spatial-Temporal information of multiple single moments enters Temporal Pyramid Net (TPN) to generate the local Spatial-Temporal information of multiple time scales. Finally, bidirectional recurrent neutral network is used to act on the Spatial-Temporal information of all parts so as to acquire the context information spanning the length of the entire video, which endows the network with video context information extraction capability. Through the experiments on the three video action recognitio common experimental data sets UCF101, UCF11, UCFSports, the Spatial-Temporal information deep fusion network proposed in this paper has a high correct recognition rate in the task of video action recognitio.
引用
收藏
页码:4533 / 4545
页数:13
相关论文
共 28 条
[1]  
Abdulmunem A., 2016, Computational Visual Media, V2, P97, DOI DOI 10.1007/S41095-016-0033-9
[2]  
[Anonymous], 2016, ARXIV160808851
[3]  
[Anonymous], 2011, HUMAN MOTION RECOGNI
[4]  
[Anonymous], 2017, CS CV
[5]   Robust action recognition using local motion and group sparsity [J].
Cho, Jungchan ;
Lee, Minsik ;
Chang, Hyung Jin ;
Oh, Songhwai .
PATTERN RECOGNITION, 2014, 47 (05) :1813-1825
[6]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[9]   AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos [J].
Kar, Amlan ;
Rai, Nishant ;
Sikka, Karan ;
Sharma, Gaurav .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5699-5708
[10]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732