Spatio-Temporal Attention Networks for Action Recognition and Detection

被引:117
|
作者
Li, Jun [1 ]
Liu, Xianglong [1 ,2 ]
Zhang, Wenxuan [1 ]
Zhang, Mingyuan [1 ]
Song, Jingkuan [3 ]
Sebe, Nicu [4 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing 10000, Peoples R China
[2] Beihang Univ, Beijing Adv Innovat Ctr Big Data Based Precis Med, Beijing 10000, Peoples R China
[3] Univ Elect Sci & Technol China, Innovat Ctr, Chengdu 610051, Peoples R China
[4] Univ Trento, Dept Informat Engn & Comp Sci, I-38122 Trento, Italy
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Task analysis; Two dimensional displays; Computer architecture; Optical imaging; Visualization; 3D CNN; spatio-temporal attention; temporal attention; spatial attention; action recognition; action detection; REPRESENTATION; VIDEOS;
D O I
10.1109/TMM.2020.2965434
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, 3D Convolutional Neural Network (3D CNN) models have been widely studied for video sequences and achieved satisfying performance in action recognition and detection tasks. However, most of the existing 3D CNNs treat all input video frames equally, thus ignoring the spatial and temporal differences across the video frames. To address the problem, we propose a spatio-temporal attention (STA) network that is able to learn the discriminative feature representation for actions, by respectively characterizing the beneficial information at both the frame level and the channel level. By simultaneously exploiting the differences in spatial and temporal dimensions, our STA module enhances the learning capability of the 3D convolutions when handling the complex videos. The proposed STA method can be wrapped as a generic module easily plugged into the state-of-the-art 3D CNN architectures for video action detection and recognition. We extensively evaluate our method on action recognition and detection tasks over three popular datasets (UCF-101, HMDB-51 and THUMOS 2014), and the experimental results demonstrate that adding our STA network module can obtain the state-of-the-art performance on UCF-101 and HMDB-51, which has the top-1 accuracies of 98.4% and 81.4% respectively, and achieve significant improvement on THUMOS 2014 dataset compared against original models.
引用
收藏
页码:2990 / 3001
页数:12
相关论文
共 50 条
  • [21] Recurrent Prediction With Spatio-Temporal Attention for Crowd Attribute Recognition
    Li, Qiaozhe
    Zhao, Xin
    He, Ran
    Huang, Kaiqi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (07) : 2167 - 2177
  • [22] Learning Sequence Descriptor Based on Spatio-Temporal Attention for Visual Place Recognition
    Zhao, Junqiao
    Zhang, Fenglin
    Cai, Yingfeng
    Tian, Gengxuan
    Mu, Wenjie
    Ye, Chen
    Feng, Tiantian
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (03) : 2351 - 2358
  • [23] Spatio-temporal hard attention learning for skeleton-based activity recognition
    Nikpour, Bahareh
    Armanfard, Narges
    PATTERN RECOGNITION, 2023, 139
  • [24] LEARNING SPATIO-TEMPORAL DEPENDENCIES FOR ACTION RECOGNITION
    Cai, Qiao
    Yin, Yafeng
    Man, Hong
    2013 20TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2013), 2013, : 3740 - 3744
  • [25] Spatio-temporal information for human action recognition
    Yao, Li
    Liu, Yunjian
    Huang, Shihui
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2016,
  • [26] Action recognition by spatio-temporal oriented energies
    Zhen, Xiantong
    Shao, Ling
    Li, Xuelong
    INFORMATION SCIENCES, 2014, 281 : 295 - 309
  • [27] Cascading spatio-temporal attention network for real-time action detection
    Yang, Jianhua
    Wang, Ke
    Li, Ruifeng
    Perner, Petra
    MACHINE VISION AND APPLICATIONS, 2023, 34 (06)
  • [28] Cascading spatio-temporal attention network for real-time action detection
    Jianhua Yang
    Ke Wang
    Ruifeng Li
    Petra Perner
    Machine Vision and Applications, 2023, 34
  • [29] Action recognition using spatio-temporal regularity based features
    Goodhart, Taylor
    Yan, Pingkun
    Shah, Mubarak
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 745 - 748
  • [30] STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition
    Dasom Ahn
    Sangwon Kim
    Byoung Chul Ko
    Applied Intelligence, 2023, 53 : 28446 - 28459