LSN: Long-Term Spatio-Temporal Network for Video Recognition

被引:0
作者
Wang, Zhenwei [1 ,2 ,3 ]
Dong, Wei [1 ,2 ,3 ]
Zhang, Bingbing [4 ]
Zhang, Jianxin [1 ,2 ,3 ]
机构
[1] Dalian Minzu Univ, Sch Comp Sci & Engn, Dalian, Peoples R China
[2] Dalian Minzu Univ, SEAC Key Lab Big Data Appl Technol, Dalian, Peoples R China
[3] Dalian Minzu Univ, Inst Machine Intelligence & Bio Comp, Dalian, Peoples R China
[4] Dalian Univ Technol, Sch Informat & Commun Engn, Dalian, Peoples R China
来源
DATA SCIENCE (ICPCSEE 2022), PT I | 2022年 / 1628卷
基金
中国国家自然科学基金;
关键词
Video action recognition; High-order RNN; Long-term spatio-temporal; ConvLSTM; HO-ConvLSTM;
D O I
10.1007/978-981-19-5194-7_24
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the out-put together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-andplay manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.
引用
收藏
页码:326 / 338
页数:13
相关论文
共 32 条
[11]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[12]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[13]   3D Convolutional Neural Networks for Human Action Recognition [J].
Ji, Shuiwang ;
Xu, Wei ;
Yang, Ming ;
Yu, Kai .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) :221-231
[14]   MoViNets: Mobile Video Networks for Efficient Video Recognition [J].
Kondratyuk, Dan ;
Yuan, Liangzhe ;
Li, Yandong ;
Zhang, Li ;
Tan, Mingxing ;
Brown, Matthew ;
Gong, Boqing .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16015-16025
[15]  
Kwak S, 2021, arXiv
[16]   SmallBigNet: Integrating Core and Contextual Views for Video Classification [J].
Li, Xianhang ;
Wang, Yali ;
Zhou, Zhipeng ;
Qiao, Yu .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :1089-1098
[17]   TEA: Temporal Excitation and Aggregation for Action Recognition [J].
Li, Yan ;
Ji, Bin ;
Shi, Xintian ;
Zhang, Jianguo ;
Kang, Bin ;
Wang, Limin .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :906-915
[18]   RESOUND: Towards Action Recognition Without Representation Bias [J].
Li, Yingwei ;
Li, Yi ;
Vasconcelos, Nuno .
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 :520-535
[19]   TSM: Temporal Shift Module for Efficient Video Understanding [J].
Lin, Ji ;
Gan, Chuang ;
Han, Song .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7082-7092
[20]   Learning Video Representations from Correspondence Proposals [J].
Liu, Xingyu ;
Lee, Joon-Young ;
Jin, Hailin .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4268-4276