Spatial-Temporal Convolutional Attention Network for Action Recognition

被引:0
作者
Luo, Huilan [1 ]
Chen, Han [1 ]
机构
[1] School of Information Engineering, Jiangxi University of Science and Technology, Jiangxi, Ganzhou
关键词
action recognition; convolutional network; deep learning; feature fusion; self attention mechanism;
D O I
10.3778/j.issn.1002-8331.2112-0579
中图分类号
学科分类号
摘要
In the task of video action recognition, whether in the spatial dimension or temporal dimension of video, how to fully learn and make use of the correlation between features has a great impact on the final recognition performance. Convolution obtains local features by calculating the correlation between feature points in the neighborhood, and self-attentional mechanism learns global information through the information interaction between all feature points. A single convolutional layer does not have the ability to learn feature correlation from the global perspective. Even repeated stacking of multiple layers only obtains several larger receptive fields. Although the self-attention layer has a global perspective, its focus is only the content relationship expressed by different feature points, ignoring the local location characteristics. In order to solve the above problems, a spatial-temporal convolutional attention network is proposed for action recognition. Spatial-temporal convolutional attention network is composed of spatial convolutional attention network and temporal convolutional attention network. Spatial convolutional attention network uses self-attention method to capture the apparent feature relationship of spatial dimension, and uses one-dimensional convolution to extract dynamic information. The temporal convolutional attention network obtains the correlation information between frame level features in the temporal dimension through the self-attention method, and uses 2D convolution to learn spatial features. Spatial-temporal convolutional attention network integrates the common test results of the two networks to improve the performance of model recognition. Experiments are carried out on HMDB51 data set. Taking ResNet50 as the baseline and introducing spatial-temporal convolutional attention module, the recognition accuracy of neural network is improved by 6.25 and 5.13 percentage points in spatial flow and temporal flow respectively. Compared with the current advanced methods, spatial-temporal convolutional attention network has obvious advantages in UCF101 and HMDB51 datasets. The spatial-temporal convolutional attention network proposed in this paper can effectively capture feature correlation information. This method combines the advantages of self-attention global connection and convolution local connection, and improves the spatial-temporal modeling ability of neural network. © 2023 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.
引用
收藏
页码:150 / 158
页数:8
相关论文
共 37 条
[1]  
WANG L, HUYNH D Q, KONIUSZ P., A comparative review of recent kinect-based action recognition algorithms[J], IEEE Transactions on Image Processing, 29, pp. 15-28, (2020)
[2]  
ZHAO H, WILDES R P., Review of video predictive understanding:early action recognition and future action prediction[J]
[3]  
SUN Z, LIU J, KE Q, Et al., Human action recognition from various data modalities:a review[J], (2020)
[4]  
SIMONYAN K, ZISSERMAN A., Two-stream convolutional networks for action recognition in videos[C], Advances in Neural Information Processing Systems, pp. 568-576, (2014)
[5]  
WANG L, XIONG Y, WANG Z, Et al., Temporal segment networks:towards good practices for deep action recognition[C], European Conference on Computer Vision, pp. 20-36, (2016)
[6]  
LAN Z Z, ZHU Y, HAUPTMANN A G., Deep local video feature for action recognition[C], Computer Vision and Pattern Recognition Workshops(CVPRW), pp. 1219-1225, (2017)
[7]  
SUN S, KUANG Z, SHENG L, Et al., Optical flow guided feature:a fast and robust motion representation for video action recognition[C], IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1390-1399, (2018)
[8]  
PIERGIOVANNI A, RYOO M., Representation flow for action recognition[C], IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9945-9953, (2020)
[9]  
JI S, YANG M, Et al., 3D convolutional neural networks for human action recognition[J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1, pp. 221-231, (2013)
[10]  
TRAN D, BOURDEV L, FERGUS R, Et al., Learning spatiotemporal features with 3D convolutional networks[C], IEEE International Conference on Computer Vision (ICCV), pp. 4489-4497, (2015)