Spatio-Temporal Attention Networks for Action Recognition and Detection

被引:117
|
作者
Li, Jun [1 ]
Liu, Xianglong [1 ,2 ]
Zhang, Wenxuan [1 ]
Zhang, Mingyuan [1 ]
Song, Jingkuan [3 ]
Sebe, Nicu [4 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing 10000, Peoples R China
[2] Beihang Univ, Beijing Adv Innovat Ctr Big Data Based Precis Med, Beijing 10000, Peoples R China
[3] Univ Elect Sci & Technol China, Innovat Ctr, Chengdu 610051, Peoples R China
[4] Univ Trento, Dept Informat Engn & Comp Sci, I-38122 Trento, Italy
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Task analysis; Two dimensional displays; Computer architecture; Optical imaging; Visualization; 3D CNN; spatio-temporal attention; temporal attention; spatial attention; action recognition; action detection; REPRESENTATION; VIDEOS;
D O I
10.1109/TMM.2020.2965434
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, 3D Convolutional Neural Network (3D CNN) models have been widely studied for video sequences and achieved satisfying performance in action recognition and detection tasks. However, most of the existing 3D CNNs treat all input video frames equally, thus ignoring the spatial and temporal differences across the video frames. To address the problem, we propose a spatio-temporal attention (STA) network that is able to learn the discriminative feature representation for actions, by respectively characterizing the beneficial information at both the frame level and the channel level. By simultaneously exploiting the differences in spatial and temporal dimensions, our STA module enhances the learning capability of the 3D convolutions when handling the complex videos. The proposed STA method can be wrapped as a generic module easily plugged into the state-of-the-art 3D CNN architectures for video action detection and recognition. We extensively evaluate our method on action recognition and detection tasks over three popular datasets (UCF-101, HMDB-51 and THUMOS 2014), and the experimental results demonstrate that adding our STA network module can obtain the state-of-the-art performance on UCF-101 and HMDB-51, which has the top-1 accuracies of 98.4% and 81.4% respectively, and achieve significant improvement on THUMOS 2014 dataset compared against original models.
引用
收藏
页码:2990 / 3001
页数:12
相关论文
共 50 条
  • [31] Spatio-temporal information for human action recognition
    Li Yao
    Yunjian Liu
    Shihui Huang
    EURASIP Journal on Image and Video Processing, 2016
  • [32] Cascading spatio-temporal attention network for real-time action detection
    Yang, Jianhua
    Wang, Ke
    Li, Ruifeng
    Perner, Petra
    MACHINE VISION AND APPLICATIONS, 2023, 34 (06)
  • [33] Cascading spatio-temporal attention network for real-time action detection
    Jianhua Yang
    Ke Wang
    Ruifeng Li
    Petra Perner
    Machine Vision and Applications, 2023, 34
  • [34] Spatio-temporal Attention Model for Tactile Texture Recognition
    Cao, Guanqun
    Zhou, Yi
    Bollegala, Danushka
    Luo, Shan
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 9896 - 9902
  • [35] Semantic-guided spatio-temporal attention for few-shot action recognition
    Jianyu Wang
    Baolin Liu
    Applied Intelligence, 2024, 54 : 2458 - 2471
  • [36] STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition
    Dasom Ahn
    Sangwon Kim
    Byoung Chul Ko
    Applied Intelligence, 2023, 53 : 28446 - 28459
  • [37] An End to End Framework With Adaptive Spatio-Temporal Attention Module for Human Action Recognition
    Liu, Shaocan
    Ma, Xin
    Wu, Hanbo
    Li, Yibin
    IEEE ACCESS, 2020, 8 : 47220 - 47231
  • [38] Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition
    Cheng, Shilei
    Xie, Mei
    Ma, Zheng
    Li, Siqi
    Gu, Song
    Yang, Feng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (01) : 220 - 224
  • [39] Semantic-guided spatio-temporal attention for few-shot action recognition
    Wang, Jianyu
    Liu, Baolin
    APPLIED INTELLIGENCE, 2024, 54 (03) : 2458 - 2471
  • [40] Spatio-temporal attention on manifold space for 3D human action recognition
    Ding, Chongyang
    Liu, Kai
    Cheng, Fei
    Belyaev, Evgeny
    APPLIED INTELLIGENCE, 2021, 51 (01) : 560 - 570