LSN: Long-Term Spatio-Temporal Network for Video Recognition

被引:0
作者
Wang, Zhenwei [1 ,2 ,3 ]
Dong, Wei [1 ,2 ,3 ]
Zhang, Bingbing [4 ]
Zhang, Jianxin [1 ,2 ,3 ]
机构
[1] Dalian Minzu Univ, Sch Comp Sci & Engn, Dalian, Peoples R China
[2] Dalian Minzu Univ, SEAC Key Lab Big Data Appl Technol, Dalian, Peoples R China
[3] Dalian Minzu Univ, Inst Machine Intelligence & Bio Comp, Dalian, Peoples R China
[4] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian, Peoples R China
来源
DATA SCIENCE (ICPCSEE 2022), PT I | 2022年 / 1628卷
基金
中国国家自然科学基金;
关键词
Video action recognition; High-order RNN; Long-term spatio-temporal; ConvLSTM; HO-ConvLSTM;
D O I
10.1007/978-981-19-5194-7_24
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the out-put together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-andplay manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.
引用
收藏
页码:326 / 338
页数:13
相关论文
共 32 条
[1]  
[Anonymous], P INT C FUZZ LOG NEU
[2]  
Ballas N., 2015, arXiv
[3]  
Bertasius G, 2021, Arxiv, DOI [arXiv:2102.05095, DOI 10.48550/ARXIV.2102.05095]
[4]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[5]  
Fan QF, 2019, Arxiv, DOI arXiv:1912.00869
[6]   X3D: Expanding Architectures for Efficient Video Recognition [J].
Feichtenhofer, Christoph .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :200-210
[7]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[8]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[9]   Res2Net: A New Multi-Scale Backbone Architecture [J].
Gao, Shang-Hua ;
Cheng, Ming-Ming ;
Zhao, Kai ;
Zhang, Xin-Yu ;
Yang, Ming-Hsuan ;
Torr, Philip .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :652-662
[10]   The "something something" video database for learning and evaluating visual common sense [J].
Goyal, Raghav ;
Kahou, Samira Ebrahimi ;
Michalski, Vincent ;
Materzynska, Joanna ;
Westphal, Susanne ;
Kim, Heuna ;
Haenel, Valentin ;
Fruend, Ingo ;
Yianilos, Peter ;
Mueller-Freitag, Moritz ;
Hoppe, Florian ;
Thurau, Christian ;
Bax, Ingo ;
Memisevic, Roland .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851