Two-stream spatiotemporal feature fusion for human action recognition

被引:43
作者
Abdelbaky, Amany [1 ]
Aly, Saleh [1 ,2 ]
机构
[1] Aswan Univ, Fac Engn, Dept Elect Engn, Aswan 81542, Egypt
[2] Majmaah Univ, Dept Informat Technol, Coll Comp & Informat Sci, Majmaah 11952, Saudi Arabia
关键词
Human action recognition; Spatiotemporal; Convolutional neural networks; Principal component analysis network; BoF; VLAD; BAG; DESCRIPTORS; HISTOGRAMS;
D O I
10.1007/s00371-020-01940-3
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Human action recognition is still a challenging topic in the computer vision field that has attracted a large number of researchers. It has a significant importance in varieties of applications such as intelligent video surveillance, sports analysis, and human-computer interaction. Recent works attempt to exploit the progress in deep learning architecture to learn spatial and temporal features from action video. However, it remains unclear how to combine spatial and temporal information with convolutional neural network. In this paper, we propose a novel human action recognition method by fusing spatial and temporal features learned from a simple unsupervised convolutional neural network called principal component analysis network (PCANet) in combination with bag-of-features (BoF) and vector of locally aggregated descriptors (VLAD ) encoding schemes. Firstly, both spatial and temporal features are learned via PCANet using a subset of frames and temporal templates for each video, while their dimensionality is reduced using whitening transformation (WT). The temporal templates are calculated using short-time motion energy images (ST-MEI) based on frame differencing. Then, the encoding scheme is applied to represent the final dual spatiotemporal PCANet features by feature fusion. Finally, the support vector machine (SVM) classifier is exploited for action recognition. Extensive experiments have been performed on two popular datasets, namely KTH and UCF sports, to evaluate the performance of proposed method. Our experimental results using leave-one-out evaluation strategy demonstrate that the proposed method presents satisfactory and comparable results on both datasets.
引用
收藏
页码:1821 / 1835
页数:15
相关论文
共 70 条
[1]  
Abdelbaky A, 2020, PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN COMMUNICATION AND COMPUTER ENGINEERING (ITCE), P257, DOI [10.1109/ITCE48509.2020.9047769, 10.1109/itce48509.2020.9047769]
[2]   Human action recognition using short-time motion energy template images and PCANet features [J].
Abdelbaky, Amany ;
Aly, Saleh .
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (16) :12561-12574
[3]   Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition [J].
Agahian, Saeid ;
Negin, Farhood ;
Kose, Cemal .
VISUAL COMPUTER, 2019, 35 (04) :591-607
[4]   Human Activity Analysis: A Review [J].
Aggarwal, J. K. ;
Ryoo, M. S. .
ACM COMPUTING SURVEYS, 2011, 43 (03)
[5]  
Aly S., 2019, MULTIMED TOOLS APPL, P1
[6]   Unknown-Length Handwritten Numeral String Recognition Using Cascade of PCA-SVMNet Classifiers [J].
Aly, Saleh ;
Mohamed, Ahmed .
IEEE ACCESS, 2019, 7 :52024-52034
[7]   User-Independent American Sign Language Alphabet Recognition Based on Depth Image and PCANet Features [J].
Aly, Walaa ;
Aly, Saleh ;
Almotairi, Sultan .
IEEE ACCESS, 2019, 7 :123138-123150
[8]  
[Anonymous], 4 UK COMP VIS STUD W
[9]  
[Anonymous], 2008, P BMVC 2008 19 BRIT
[10]  
Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/TPAMI.2017.2711011, 10.1109/CVPR.2016.572]