Local motion feature extraction and spatiotemporal attention mechanism for action recognition

被引:2
作者
Song, Xiaogang [1 ]
Zhang, Dongdong [1 ]
Liang, Li [1 ]
He, Min [2 ]
Hei, Xinhong [1 ]
机构
[1] Xian Univ Technol, Sch Comp Sci & Engn, Xian, Peoples R China
[2] Xian Univ Technol, Sch Civil Engn & Architecture, Xian, Peoples R China
基金
国家重点研发计划;
关键词
Action recognition; Spatiotemporal attention; Convolution neural network; Abnormal behavior;
D O I
10.1007/s00371-023-03205-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Video action recognition faces the important and challenging problem of spatiotemporal relationship modeling. In order to solve this issue, current methods typically rely on 2D or 3D CNN operations to model local spatiotemporal dependencies at fixed scales. However, most of these models fail to emphasize the keyframes and action-sensitive regions of the input video, resulting in poor performance. In this paper, an action recognition network with local motion feature extraction and spatiotemporal attention mechanism is proposed. The proposed network consists of a motion capture (MC) module and a temporal attention (TA) and spatiotemporal attention (STA) module, which capture detailed motion features, and learns the contribution of each frame and each region to the action at the feature level, respectively. To evaluate our network, we construct a concrete water addition violation dataset (CWAVD), which can be used to identify water addition violations by construction site workers and improve construction management efficiency and quality. The proposed network achieves the state-of-the-art performance on three of the most challenging datasets, UCF101 (97.6%), HMDB51 (77.3%) and SSV2 (67.8%).
引用
收藏
页码:7747 / 7759
页数:13
相关论文
共 44 条
[1]   Two-stream spatiotemporal feature fusion for human action recognition [J].
Abdelbaky, Amany ;
Aly, Saleh .
VISUAL COMPUTER, 2021, 37 (07) :1821-1835
[2]   The recognition of human movement using temporal templates [J].
Bobick, AF ;
Davis, JW .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) :257-267
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]   MARS: Motion-Augmented RGB Stream for Action Recognition [J].
Crasto, Nieves ;
Weinzaepfel, Philippe ;
Alahari, Karteek ;
Schmid, Cordelia .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883
[5]   TF-Blender: Temporal Feature Blender for Video Object Detection [J].
Cui, Yiming ;
Yan, Liqi ;
Cao, Zhiwen ;
Liu, Dongfang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :8118-8127
[6]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[7]   Spatio-temporal Channel Correlation Networks for Action Classification [J].
Diba, Ali ;
Fayyaz, Mohsen ;
Sharma, Vivek ;
Arzani, M. Mahdi ;
Yousefzadeh, Rahman ;
Gall, Juergen ;
Van Gool, Luc .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :299-315
[8]   Identifying the key frames: An attention-aware sampling method for action recognition [J].
Dong, Wenkai ;
Zhang, Zhaoxiang ;
Song, Chunfeng ;
Tan, Tieniu .
PATTERN RECOGNITION, 2022, 130
[9]   FlowNet: Learning Optical Flow with Convolutional Networks [J].
Dosovitskiy, Alexey ;
Fischer, Philipp ;
Ilg, Eddy ;
Haeusser, Philip ;
Hazirbas, Caner ;
Golkov, Vladimir ;
van der Smagt, Patrick ;
Cremers, Daniel ;
Brox, Thomas .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766
[10]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497