LSN: Long-Term Spatio-Temporal Network for Video Recognition

被引:0
作者
Wang, Zhenwei [1 ,2 ,3 ]
Dong, Wei [1 ,2 ,3 ]
Zhang, Bingbing [4 ]
Zhang, Jianxin [1 ,2 ,3 ]
机构
[1] Dalian Minzu Univ, Sch Comp Sci & Engn, Dalian, Peoples R China
[2] Dalian Minzu Univ, SEAC Key Lab Big Data Appl Technol, Dalian, Peoples R China
[3] Dalian Minzu Univ, Inst Machine Intelligence & Bio Comp, Dalian, Peoples R China
[4] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian, Peoples R China
来源
DATA SCIENCE (ICPCSEE 2022), PT I | 2022年 / 1628卷
基金
中国国家自然科学基金;
关键词
Video action recognition; High-order RNN; Long-term spatio-temporal; ConvLSTM; HO-ConvLSTM;
D O I
10.1007/978-981-19-5194-7_24
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the out-put together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-andplay manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.
引用
收藏
页码:326 / 338
页数:13
相关论文
共 32 条
  • [1] Ballas N., 2015, arXiv
  • [2] Bertasius G, 2021, Arxiv, DOI arXiv:2102.05095
  • [3] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [4] Fan QF, 2019, Arxiv, DOI arXiv:1912.00869
  • [5] X3D: Expanding Architectures for Efficient Video Recognition
    Feichtenhofer, Christoph
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 200 - 210
  • [6] SlowFast Networks for Video Recognition
    Feichtenhofer, Christoph
    Fan, Haoqi
    Malik, Jitendra
    He, Kaiming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
  • [7] Convolutional Two-Stream Network Fusion for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Zisserman, Andrew
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941
  • [8] Res2Net: A New Multi-Scale Backbone Architecture
    Gao, Shang-Hua
    Cheng, Ming-Ming
    Zhao, Kai
    Zhang, Xin-Yu
    Yang, Ming-Hsuan
    Torr, Philip
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) : 652 - 662
  • [9] The "something something" video database for learning and evaluating visual common sense
    Goyal, Raghav
    Kahou, Samira Ebrahimi
    Michalski, Vincent
    Materzynska, Joanna
    Westphal, Susanne
    Kim, Heuna
    Haenel, Valentin
    Fruend, Ingo
    Yianilos, Peter
    Mueller-Freitag, Moritz
    Hoppe, Florian
    Thurau, Christian
    Bax, Ingo
    Memisevic, Roland
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5843 - 5851
  • [10] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]