SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

被引:5
|
作者
Wang, Fei [1 ]
Wang, Guorui [2 ]
Huang, Yunwen [2 ]
Chu, Hao [1 ]
机构
[1] Northeastern Univ, Fac Robot Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
[2] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
Action recognition; action-aware spatial-temporal features; deformable convolution; temporal attention model;
D O I
10.1109/ACCESS.2019.2953113
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.
引用
收藏
页码:164876 / 164886
页数:11
相关论文
共 50 条
  • [31] Spatial-temporal channel-wise attention network for action recognition
    Lin Chen
    Yungang Liu
    Yongchao Man
    Multimedia Tools and Applications, 2021, 80 : 21789 - 21808
  • [32] Deep Fusion of Skeleton Spatial-Temporal and Dynamic Information for Action Recognition
    Gao, Song
    Zhang, Dingzhuo
    Tang, Zhaoming
    Wang, Hongyan
    SENSORS, 2024, 24 (23)
  • [33] A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition
    Wang, Huafeng
    Xia, Tao
    Li, Hanlin
    Gu, Xianfeng
    Lv, Weifeng
    Wang, Yuehai
    MATHEMATICS, 2021, 9 (24)
  • [34] Improved SSD using deep multi-scale attention spatial-temporal features for action recognition
    Zhou, Shuren
    Qiu, Jia
    Solanki, Arun
    MULTIMEDIA SYSTEMS, 2022, 28 (06) : 2123 - 2131
  • [35] Extracting hierarchical spatial and temporal features for human action recognition
    Keting Zhang
    Liqing Zhang
    Multimedia Tools and Applications, 2018, 77 : 16053 - 16068
  • [36] Extracting hierarchical spatial and temporal features for human action recognition
    Zhang, Keting
    Zhang, Liqing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (13) : 16053 - 16068
  • [37] Learning semantic features for action recognition via diffusion maps
    Liu, Jingen
    Yang, Yang
    Saleemi, Imran
    Shah, Mubarak
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2012, 116 (03) : 361 - 377
  • [38] Action Recognition by Fusing Spatial-Temporal Appearance and The Local Distribution of Interest Points
    Lu, Mengmeng
    Zhang, Liang
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON FUTURE COMPUTER AND COMMUNICATION ENGINEERING, 2014, 111 : 75 - 78
  • [39] ST-HViT: spatial-temporal hierarchical vision transformer for action recognition
    Xia, Limin
    Fu, Weiye
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (01)
  • [40] Spatial-temporal graph attention networks for skeleton-based action recognition
    Huang, Qingqing
    Zhou, Fengyu
    He, Jiakai
    Zhao, Yang
    Qin, Runze
    JOURNAL OF ELECTRONIC IMAGING, 2020, 29 (05)