Improved SSD using deep multi-scale attention spatial-temporal features for action recognition

被引:3
作者
Zhou, Shuren [1 ]
Qiu, Jia [1 ]
Solanki, Arun [2 ]
机构
[1] Changsha Univ Sci & Technol, Sch Comp & Commun Engn, Changsha 410114, Peoples R China
[2] Gautam Buddha Univ, Sch Informat & Commun Technol, Noida, India
关键词
Action recognition; Multi-scale spatial-temporal feature; Attention mechanism;
D O I
10.1007/s00530-021-00831-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The biggest difference between video-based action recognition and image-based action recognition is that the former has an extra feature of time dimension. Most methods of action recognition based on deep learning adopt: (1) using 3D convolution to modeling the temporal features; (2) introducing an auxiliary temporal feature, such as optical flow. However, the 3D convolution network usually consumes huge computational resources. The extraction of optical flow requires an extra tedious process with an extra space for storage, and is usually modeled for short-range temporal features. To construct the temporal features better, in this paper we propose a multi-scale attention spatial-temporal features network based on SSD, by means of piecewise on long range of the whole video sequence to sparse sampling of video, using the self-attention mechanism to capture the relation between one frame and the sequence of frames sampled on the entire range of video, making the network notice the representative frames on the sequence. Moreover, the attention mechanism is used to assign different weights to the inter-frame relations representing different time scales, so as to reasoning the contextual relations of actions in the time dimension. Our proposed method achieves competitive performance on two commonly used datasets: UCF101 and HMDB51.
引用
收藏
页码:2123 / 2131
页数:9
相关论文
共 51 条
[1]   Multi-View Super Vector for Action Recognition [J].
Cai, Zhuowei ;
Wang, Limin ;
Peng, Xiaojiang ;
Qiao, Yu .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :596-603
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[4]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[5]   Spatio-temporal Channel Correlation Networks for Action Classification [J].
Diba, Ali ;
Fayyaz, Mohsen ;
Sharma, Vivek ;
Arzani, M. Mahdi ;
Yousefzadeh, Rahman ;
Gall, Juergen ;
Van Gool, Luc .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :299-315
[6]   FlowNet: Learning Optical Flow with Convolutional Networks [J].
Dosovitskiy, Alexey ;
Fischer, Philipp ;
Ilg, Eddy ;
Haeusser, Philip ;
Hazirbas, Caner ;
Golkov, Vladimir ;
van der Smagt, Patrick ;
Cremers, Daniel ;
Brox, Thomas .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]   Video understanding for complex activity recognition [J].
Fusier, Florent ;
Valentin, Valery ;
Bremond, Francois ;
Thonnat, Monique ;
Borg, Mark ;
Thirde, David ;
Ferryman, James .
MACHINE VISION AND APPLICATIONS, 2007, 18 (3-4) :167-188
[9]   Actions as space-time shapes [J].
Gorelick, Lena ;
Blank, Moshe ;
Shechtman, Eli ;
Irani, Michal ;
Basri, Ronen .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (12) :2247-2253
[10]   Efficient and secure attribute-based signature for monotone predicates [J].
Gu, Ke ;
Jia, Weijia ;
Wang, Guojun ;
Wen, Sheng .
ACTA INFORMATICA, 2017, 54 (05) :521-541