SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

被引:5
|
作者
Wang, Fei [1 ]
Wang, Guorui [2 ]
Huang, Yunwen [2 ]
Chu, Hao [1 ]
机构
[1] Northeastern Univ, Fac Robot Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
[2] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110004, Liaoning, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
Action recognition; action-aware spatial-temporal features; deformable convolution; temporal attention model;
D O I
10.1109/ACCESS.2019.2953113
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.
引用
收藏
页码:164876 / 164886
页数:11
相关论文
共 50 条
  • [41] CASCADED TEMPORAL SPATIAL FEATURES FOR VIDEO ACTION RECOGNITION
    Yu, Tingzhao
    Gu, Huxiang
    Wang, Lingfeng
    Xiang, Shiming
    Pan, Chunhong
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 1552 - 1556
  • [42] Spatio-temporal Semantic Features for Human Action Recognition
    Liu, Jia
    Wang, Xiaonian
    Li, Tianyu
    Yang, Jie
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2012, 6 (10): : 2632 - 2649
  • [43] Joints-Centered Spatial-Temporal Features Fused Skeleton Convolution Network for Action Recognition
    Song, Wenfeng
    Chu, Tangli
    Li, Shuai
    Li, Nannan
    Hao, Aimin
    Qin, Hong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4602 - 4616
  • [44] Human Action Recognition by Decision-Making Level Fusion Based on Spatial-Temporal Features
    Li Yandi
    Xu Xiping
    ACTA OPTICA SINICA, 2018, 38 (08)
  • [45] Spatial-Temporal Context-Aware Online Action Detection and Prediction
    Huang, Jingjia
    Li, Nannan
    Li, Thomas
    Liu, Shan
    Li, Ge
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) : 2650 - 2662
  • [46] Action recognition by learning temporal slowness invariant features
    Lishen Pei
    Mao Ye
    Xuezhuan Zhao
    Yumin Dou
    Jiao Bao
    The Visual Computer, 2016, 32 : 1395 - 1404
  • [47] Action recognition by learning temporal slowness invariant features
    Pei, Lishen
    Ye, Mao
    Zhao, Xuezhuan
    Dou, Yumin
    Bao, Jiao
    VISUAL COMPUTER, 2016, 32 (11): : 1395 - 1404
  • [48] Human action recognition via multi-task learning base on spatial-temporal feature
    Guo, Wenzhong
    Chen, Guolong
    INFORMATION SCIENCES, 2015, 320 : 418 - 428
  • [49] Convolutional non-local spatial-temporal learning for multi-modality action recognition
    Ren, Ziliang
    Yuan, Huaqiang
    Wei, Wenhong
    Zhao, Tiezhu
    Zhang, Qieshi
    ELECTRONICS LETTERS, 2022, 58 (20) : 765 - 767
  • [50] Video-based Driver Action Recognition via Spatial-Temporal and Motion Deep Learning
    Ma, Fangzhi
    Xing, Guanyu
    Liu, Yanli
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,