STM: SpatioTemporal and Motion Encoding for Action Recognition

被引:338
作者
Jiang, Boyuan [1 ,3 ]
Wang, MengMeng [2 ]
Gan, Weihao [2 ]
Wu, Wei [2 ]
Yan, Junjie [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] SenseTime Grp Ltd, Hong Kong, Peoples R China
[3] SenseTime, Hong Kong, Peoples R China
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.00209
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.
引用
收藏
页码:2000 / 2009
页数:10
相关论文
共 50 条
  • [31] Spatiotemporal Pyramid Network for Video Action Recognition
    Wang, Yunbo
    Long, Mingsheng
    Wang, Jianmin
    Yu, Philip S.
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106
  • [32] A Spatiotemporal Motion Variation Features Extraction Approach for Human Tracking and Pose-based Action Recognition
    Jalal, Ahmad
    Kamal, Shaharyar
    Farooq, Adnan
    Kim, Daijin
    2015 4TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION ICIEV 15, 2015,
  • [33] Spatiotemporal Fusion Networks for Video Action Recognition
    Liu, Zheng
    Hu, Haifeng
    Zhang, Junxuan
    NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
  • [34] Fast spatiotemporal MACH filter for action recognition
    Javed Ahmed
    Sadaf Abbasi
    M. Zakir Shaikh
    Machine Vision and Applications, 2013, 24 : 909 - 918
  • [35] Human action recognition based on action relevance weighted encoding
    Yi, Yang
    Li, Ao
    Zhou, Xiaofeng
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 80
  • [36] Attention-Based Temporal Encoding Network with Background-Independent Motion Mask for Action Recognition
    Weng, Zhengkui
    Jin, Zhipeng
    Chen, Shuangxi
    Shen, Quanquan
    Ren, Xiangyang
    Li, Wuzhao
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [37] Hierarchical Dynamic Parsing and Encoding for Action Recognition
    Su, Bing
    Zhou, Jiahuan
    Ding, Xiaoqing
    Wang, Hao
    Wu, Ying
    COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 : 202 - 217
  • [38] Deep Temporal Feature Encoding for Action Recognition
    Li, Lin
    Zhang, Zhaoxiang
    Huang, Yan
    Wang, Liang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 1109 - 1114
  • [39] Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition
    Tu, Zhigang
    Li, Hongyan
    Zhang, Dejun
    Dauwels, Justin
    Li, Baoxin
    Yuan, Junsong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) : 2799 - 2812
  • [40] RRV: A Spatiotemporal Descriptor for Rigid Body Motion Recognition
    Guo, Yao
    Li, Youfu
    Shao, Zhanpeng
    IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (05) : 1513 - 1525