LSN: Long-Term Spatio-Temporal Network for Video Recognition

被引:0
作者
Wang, Zhenwei [1 ,2 ,3 ]
Dong, Wei [1 ,2 ,3 ]
Zhang, Bingbing [4 ]
Zhang, Jianxin [1 ,2 ,3 ]
机构
[1] Dalian Minzu Univ, Sch Comp Sci & Engn, Dalian, Peoples R China
[2] Dalian Minzu Univ, SEAC Key Lab Big Data Appl Technol, Dalian, Peoples R China
[3] Dalian Minzu Univ, Inst Machine Intelligence & Bio Comp, Dalian, Peoples R China
[4] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian, Peoples R China
来源
DATA SCIENCE (ICPCSEE 2022), PT I | 2022年 / 1628卷
基金
中国国家自然科学基金;
关键词
Video action recognition; High-order RNN; Long-term spatio-temporal; ConvLSTM; HO-ConvLSTM;
D O I
10.1007/978-981-19-5194-7_24
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the out-put together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-andplay manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.
引用
收藏
页码:326 / 338
页数:13
相关论文
共 32 条
  • [11] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [12] 3D Convolutional Neural Networks for Human Action Recognition
    Ji, Shuiwang
    Xu, Wei
    Yang, Ming
    Yu, Kai
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) : 221 - 231
  • [13] MoViNets: Mobile Video Networks for Efficient Video Recognition
    Kondratyuk, Dan
    Yuan, Liangzhe
    Li, Yandong
    Zhang, Li
    Tan, Mingxing
    Brown, Matthew
    Gong, Boqing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16015 - 16025
  • [14] Kwak S, 2021, arXiv
  • [15] SmallBigNet: Integrating Core and Contextual Views for Video Classification
    Li, Xianhang
    Wang, Yali
    Zhou, Zhipeng
    Qiao, Yu
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 1089 - 1098
  • [16] TEA: Temporal Excitation and Aggregation for Action Recognition
    Li, Yan
    Ji, Bin
    Shi, Xintian
    Zhang, Jianguo
    Kang, Bin
    Wang, Limin
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 906 - 915
  • [17] RESOUND: Towards Action Recognition Without Representation Bias
    Li, Yingwei
    Li, Yi
    Vasconcelos, Nuno
    [J]. COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 520 - 535
  • [18] TSM: Temporal Shift Module for Efficient Video Understanding
    Lin, Ji
    Gan, Chuang
    Han, Song
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7082 - 7092
  • [19] Learning Video Representations from Correspondence Proposals
    Liu, Xingyu
    Lee, Joon-Young
    Jin, Hailin
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4268 - 4276
  • [20] Pascanu R., 2013, P 2013 INT C MACHINE, DOI DOI 10.1007/S12088-011-0245-8