SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

被引:5
|
作者
Wang, Fei [1 ]
Wang, Guorui [2 ]
Huang, Yunwen [2 ]
Chu, Hao [1 ]
机构
[1] Northeastern Univ, Fac Robot Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
[2] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
Action recognition; action-aware spatial-temporal features; deformable convolution; temporal attention model;
D O I
10.1109/ACCESS.2019.2953113
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.
引用
收藏
页码:164876 / 164886
页数:11
相关论文
共 50 条
  • [41] Bi-direction hierarchical LSTM with spatial-temporal attention for action recognition
    Yang, Haodong
    Zhang, Jun
    Li, Shuohao
    Luo, Tingjin
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (01) : 775 - 786
  • [42] R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition
    Liu, Quanle
    Che, Xiangjiu
    Bie, Mei
    IEEE ACCESS, 2019, 7 : 82246 - 82255
  • [43] STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition
    Zhang, Yuhan
    Wu, Bo
    Li, Wen
    Duan, Lixin
    Gan, Chuang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3229 - 3237
  • [44] Human Action Recognition by Fusion of Convolutional Neural Networks and spatial-temporal Information
    Li, Weisheng
    Ding, Yahui
    8TH INTERNATIONAL CONFERENCE ON INTERNET MULTIMEDIA COMPUTING AND SERVICE (ICIMCS2016), 2016, : 255 - 259
  • [45] Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition
    Gao, Zhimin
    Wang, Peitao
    Lv, Pei
    Jiang, Xiaoheng
    Liu, Qidong
    Wang, Pichao
    Xu, Mingliang
    Li, Wanqing
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 155 - 171
  • [46] A Spatial-Temporal Feature Fusion Strategy for Skeleton-Based Action Recognition
    Chen, Yitian
    Xu, Yuchen
    Xie, Qianglai
    Xiong, Lei
    Yao, Leiyue
    2023 INTERNATIONAL CONFERENCE ON DATA SECURITY AND PRIVACY PROTECTION, DSPP, 2023, : 207 - 215
  • [47] Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition
    Xu, Haotian
    Jin, Xiaobo
    Wang, Qiufeng
    Hussain, Amir
    Huang, Kaizhu
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
  • [48] EFFICIENT TEMPORAL-SPATIAL FEATURE GROUPING FOR VIDEO ACTION RECOGNITION
    Qiu, Zhikang
    Zhao, Xu
    Hu, Zhilan
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2176 - 2180
  • [49] Spatial-Temporal Adaptive Metric Learning Network for One-Shot Skeleton-Based Action Recognition
    Li, Xuanfeng
    Lu, Jian
    Chen, Xiaogai
    Zhang, Xiaodan
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 321 - 325
  • [50] Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition
    Ren, Ziliang
    Zhang, Qieshi
    Cheng, Jun
    Hao, Fusheng
    Gao, Xiangyang
    NEUROCOMPUTING, 2021, 433 : 142 - 153