SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

被引:5
|
作者
Wang, Fei [1 ]
Wang, Guorui [2 ]
Huang, Yunwen [2 ]
Chu, Hao [1 ]
机构
[1] Northeastern Univ, Fac Robot Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
[2] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
Action recognition; action-aware spatial-temporal features; deformable convolution; temporal attention model;
D O I
10.1109/ACCESS.2019.2953113
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.
引用
收藏
页码:164876 / 164886
页数:11
相关论文
共 50 条
  • [1] Learning Semantic-Aware Spatial-Temporal Attention for Interpretable Action Recognition
    Fu, Jie
    Gao, Junyu
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5213 - 5224
  • [2] Grouped Spatial-Temporal Aggregation for Efficient Action Recognition
    Luo, Chenxu
    Yuille, Alan
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5511 - 5520
  • [3] Spatial-Temporal Interleaved Network for Efficient Action Recognition
    Jiang, Shengqin
    Zhang, Haokui
    Qi, Yuankai
    Liu, Qingshan
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2025, 21 (01) : 178 - 187
  • [4] STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition
    Nguyen, Tam V.
    Song, Zheng
    Yan, Shuicheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2015, 25 (01) : 77 - 86
  • [5] Spatial-Temporal Attention for Action Recognition
    Sun, Dengdi
    Wu, Hanqing
    Ding, Zhuanlian
    Luo, Bin
    Tang, Jin
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 854 - 864
  • [6] Learning spatial-temporal features via a pose-flow relational model for action recognition
    Wu, Qianyu
    Hu, Fangqiang
    Zhu, Aichun
    Wang, Zixuan
    Bao, Yaping
    AIP ADVANCES, 2020, 10 (07)
  • [7] Joint spatial-temporal attention for action recognition
    Yu, Tingzhao
    Guo, Chaoxu
    Wang, Lingfeng
    Gu, Huxiang
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION LETTERS, 2018, 112 : 226 - 233
  • [8] Spatial-Temporal Neural Networks for Action Recognition
    Jing, Chao
    Wei, Ping
    Sun, Hongbin
    Zheng, Nanning
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2018, 2018, 519 : 619 - 627
  • [9] Spatial-temporal pooling for action recognition in videos
    Wang, Jiaming
    Shao, Zhenfeng
    Huang, Xiao
    Lu, Tao
    Zhang, Ruiqian
    Lv, Xianwei
    NEUROCOMPUTING, 2021, 451 : 265 - 278
  • [10] Spatial-temporal interaction module for action recognition
    Luo, Hui-Lan
    Chen, Han
    Cheung, Yiu-Ming
    Yu, Yawei
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)