SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

被引：5

作者：

Wang, Fei ^{[1
]}

Wang, Guorui ^{[2
]}

Huang, Yunwen ^{[2
]}

Chu, Hao ^{[1
]}

机构：

[1] Northeastern Univ, Fac Robot Sci & Engn, Shenyang 110004, Liaoning, Peoples R China

[2] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110004, Liaoning, Peoples R China

来源：

IEEE ACCESS | 2019年 / 7卷

关键词：

Action recognition; action-aware spatial-temporal features; deformable convolution; temporal attention model;

D O I：

10.1109/ACCESS.2019.2953113

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.

引用

页码：164876 / 164886

页数：11

共 50 条

[21] Spatial-Temporal Context-Aware Online Action Detection and Prediction
Huang, Jingjia
Li, Nannan
Li, Thomas
Liu, Shan
Li, Ge
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) : 2650 - 2662
[22] Action recognition by learning temporal slowness invariant features
Pei, Lishen
Ye, Mao
Zhao, Xuezhuan
Dou, Yumin
Bao, Jiao
VISUAL COMPUTER, 2016, 32 (11) : 1395 - 1404
[23] Human action recognition via multi-task learning base on spatial-temporal feature
Guo, Wenzhong
Chen, Guolong
INFORMATION SCIENCES, 2015, 320 : 418 - 428
[24] Action recognition by learning temporal slowness invariant features
Lishen Pei
Mao Ye
Xuezhuan Zhao
Yumin Dou
Jiao Bao
The Visual Computer, 2016, 32 : 1395 - 1404
[25] Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition
Liu, Xing
Li, Yanshan
Xia, Rongjie
SIGNAL IMAGE AND VIDEO PROCESSING, 2020, 14 (06) : 1227 - 1234
[26] A Novel Action Recognition Scheme Based on Spatial-Temporal Pyramid Model
Zhao, Hengying
Xiang, Xinguang
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 212 - 221
[27] Spatial-temporal channel-wise attention network for action recognition
Chen, Lin
Liu, Yungang
Man, Yongchao
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (14) : 21789 - 21808
[28] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
Huang, Jianfeng
Liu, Xiang
Hu, Huan
Tang, Shanghua
Li, Chenyang
Zhao, Shaoan
Lin, Yimin
Wang, Kai
Liu, Zhaoxiang
Lian, Shiguo
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
[29] Recurrent attention network using spatial-temporal relations for action recognition
Zhang, Mingxing
Yang, Yang
Ji, Yanli
Xie, Ning
Shen, Fumin
SIGNAL PROCESSING, 2018, 145 : 137 - 145
[30] Spatial-temporal pyramid based Convolutional Neural Network for action recognition
Zheng, Zhenxing
An, Gaoyun
Wu, Dapeng
Ruan, Qiuqi
NEUROCOMPUTING, 2019, 358 : 446 - 455

← 1 2 3 4 5 →