Hybrid handcrafted and learned feature framework for human action recognition

被引:12
作者
Zhang, Chaolong [1 ,2 ]
Xu, Yuanping [2 ]
Xu, Zhijie [1 ]
Huang, Jian [2 ]
Lu, Jun [2 ]
机构
[1] Univ Huddersfield, Sch Comp & Engn, Huddersfield HD1 3DH, W Yorkshire, England
[2] Chengdu Univ Informat Technol, Sch Software Engn, Chengdu, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Dense trajectories; Bag-of-temporal features; Visual stream; Motion stream; VISUAL-WORDS; BAG;
D O I
10.1007/s10489-021-03068-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recognising human actions in video is a challenging task in real-world. Dense trajectory (DT) offers accurate recording of motions over time that is rich in dynamic information. However, DT models lack the mechanism to distinguish dominant motions from secondary ones over separable frequency bands and directions. By contrast, deep learning-based methods are promising over the challenge though still suffering from limited capacity in handling complex temporal information, not mentioning huge datasets needed to guide the training. To take the advantage of semantical meaningful and "handcrafted" video features through feature engineering, this study integrates the discrete wavelet transform (DWT) technique into the DT model for gaining more descriptive human action features. Through exploring the pre-trained dual-stream CNN-RNN models, learned features can be integrated with the handcrafted ones to satisfy stringent analytical requirements within the spatial-temporal domain. This hybrid feature framework generates efficient Fisher Vectors through a novel Bag of Temporal Features scheme and is capable of encoding video events whilst speeding up action recognition for real-world applications. Evaluation of the design has shown superior recognition performance over existing benchmark systems. It has also demonstrated promising applicability and extensibility for solving challenging real-world human action recognition problems.
引用
收藏
页码:12771 / 12787
页数:17
相关论文
共 56 条
[1]   Bag of spatio-visual words for context inference in scene classification [J].
Bolovinou, A. ;
Pratikakis, I. ;
Perantonis, S. .
PATTERN RECOGNITION, 2013, 46 (03) :1039-1053
[2]   Survey on SVM and their application in image classification [J].
Chandra M.A. ;
Bedi S.S. .
International Journal of Information Technology, 2021, 13 (5) :1-11
[3]  
Chang JL, 2017, IEEE I CONF COMP VIS, P5880, DOI [10.1109/ICCV.2017.626, 10.1109/ICCV.2017.627]
[4]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[5]   Two Stream LSTM : A Deep Fusion Framework for Human Action Recognition [J].
Gammulle, Harshala ;
Denman, Simon ;
Sridharan, Sridha ;
Fookes, Clinton .
2017 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2017), 2017, :177-186
[6]   3D Convolutional Neural Networks for Human Action Recognition [J].
Ji, Shuiwang ;
Xu, Wei ;
Yang, Ming ;
Yu, Kai .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) :221-231
[7]  
Jiang J, 2017, CHIN AUTOM CONGR, P1175, DOI 10.1109/CAC.2017.8242944
[8]   End-to-end Face Detection and Cast Grouping in Movies Using Erdos-Renyi Clustering [J].
Jin, SouYoung ;
Su, Hang ;
Stauffer, Chris ;
Learned-Miller, Erik .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5286-5295
[9]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732
[10]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90