Local motion feature extraction and spatiotemporal attention mechanism for action recognition

被引:0
作者
Song, Xiaogang [1 ]
Zhang, Dongdong [1 ]
Liang, Li [1 ]
He, Min [2 ]
Hei, Xinhong [1 ]
机构
[1] Xian Univ Technol, Sch Comp Sci & Engn, Xian, Peoples R China
[2] Xian Univ Technol, Sch Civil Engn & Architecture, Xian, Peoples R China
基金
国家重点研发计划;
关键词
Action recognition; Spatiotemporal attention; Convolution neural network; Abnormal behavior;
D O I
10.1007/s00371-023-03205-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Video action recognition faces the important and challenging problem of spatiotemporal relationship modeling. In order to solve this issue, current methods typically rely on 2D or 3D CNN operations to model local spatiotemporal dependencies at fixed scales. However, most of these models fail to emphasize the keyframes and action-sensitive regions of the input video, resulting in poor performance. In this paper, an action recognition network with local motion feature extraction and spatiotemporal attention mechanism is proposed. The proposed network consists of a motion capture (MC) module and a temporal attention (TA) and spatiotemporal attention (STA) module, which capture detailed motion features, and learns the contribution of each frame and each region to the action at the feature level, respectively. To evaluate our network, we construct a concrete water addition violation dataset (CWAVD), which can be used to identify water addition violations by construction site workers and improve construction management efficiency and quality. The proposed network achieves the state-of-the-art performance on three of the most challenging datasets, UCF101 (97.6%), HMDB51 (77.3%) and SSV2 (67.8%).
引用
收藏
页码:7747 / 7759
页数:13
相关论文
共 44 条
  • [1] Two-stream spatiotemporal feature fusion for human action recognition
    Abdelbaky, Amany
    Aly, Saleh
    [J]. VISUAL COMPUTER, 2021, 37 (07) : 1821 - 1835
  • [2] The recognition of human movement using temporal templates
    Bobick, AF
    Davis, JW
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) : 257 - 267
  • [3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [4] MARS: Motion-Augmented RGB Stream for Action Recognition
    Crasto, Nieves
    Weinzaepfel, Philippe
    Alahari, Karteek
    Schmid, Cordelia
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7874 - 7883
  • [5] TF-Blender: Temporal Feature Blender for Video Object Detection
    Cui, Yiming
    Yan, Liqi
    Cao, Zhiwen
    Liu, Dongfang
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8118 - 8127
  • [6] Deformable Convolutional Networks
    Dai, Jifeng
    Qi, Haozhi
    Xiong, Yuwen
    Li, Yi
    Zhang, Guodong
    Hu, Han
    Wei, Yichen
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 764 - 773
  • [7] Spatio-temporal Channel Correlation Networks for Action Classification
    Diba, Ali
    Fayyaz, Mohsen
    Sharma, Vivek
    Arzani, M. Mahdi
    Yousefzadeh, Rahman
    Gall, Juergen
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 299 - 315
  • [8] Identifying the key frames: An attention-aware sampling method for action recognition
    Dong, Wenkai
    Zhang, Zhaoxiang
    Song, Chunfeng
    Tan, Tieniu
    [J]. PATTERN RECOGNITION, 2022, 130
  • [9] FlowNet: Learning Optical Flow with Convolutional Networks
    Dosovitskiy, Alexey
    Fischer, Philipp
    Ilg, Eddy
    Haeusser, Philip
    Hazirbas, Caner
    Golkov, Vladimir
    van der Smagt, Patrick
    Cremers, Daniel
    Brox, Thomas
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2758 - 2766
  • [10] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497