STM: SpatioTemporal and Motion Encoding for Action Recognition

被引：338

作者：

Jiang, Boyuan ^{[1
,3
]}

Wang, MengMeng ^{[2
]}

Gan, Weihao ^{[2
]}

Wu, Wei ^{[2
]}

Yan, Junjie ^{[2
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] SenseTime Grp Ltd, Hong Kong, Peoples R China

[3] SenseTime, Hong Kong, Peoples R China

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.00209

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

引用

页码：2000 / 2009

页数：10

共 50 条

[31] Spatiotemporal Pyramid Network for Video Action Recognition
Wang, Yunbo
Long, Mingsheng
Wang, Jianmin
Yu, Philip S.
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106
[32] A Spatiotemporal Motion Variation Features Extraction Approach for Human Tracking and Pose-based Action Recognition
Jalal, Ahmad
Kamal, Shaharyar
Farooq, Adnan
Kim, Daijin
2015 4TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION ICIEV 15, 2015,
[33] Spatiotemporal Fusion Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
Zhang, Junxuan
NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
[34] Fast spatiotemporal MACH filter for action recognition
Javed Ahmed
Sadaf Abbasi
M. Zakir Shaikh
Machine Vision and Applications, 2013, 24 : 909 - 918
[35] Human action recognition based on action relevance weighted encoding
Yi, Yang
Li, Ao
Zhou, Xiaofeng
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 80
[36] Attention-Based Temporal Encoding Network with Background-Independent Motion Mask for Action Recognition
Weng, Zhengkui
Jin, Zhipeng
Chen, Shuangxi
Shen, Quanquan
Ren, Xiangyang
Li, Wuzhao
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
[37] Hierarchical Dynamic Parsing and Encoding for Action Recognition
Su, Bing
Zhou, Jiahuan
Ding, Xiaoqing
Wang, Hao
Wu, Ying
COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 : 202 - 217
[38] Deep Temporal Feature Encoding for Action Recognition
Li, Lin
Zhang, Zhaoxiang
Huang, Yan
Wang, Liang
2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 1109 - 1114
[39] Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition
Tu, Zhigang
Li, Hongyan
Zhang, Dejun
Dauwels, Justin
Li, Baoxin
Yuan, Junsong
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) : 2799 - 2812
[40] RRV: A Spatiotemporal Descriptor for Rigid Body Motion Recognition
Guo, Yao
Li, Youfu
Shao, Zhanpeng
IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (05) : 1513 - 1525

← 1 2 3 4 5 →