Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

被引:27
作者
Agethen, Sebastian [1 ]
Hsu, Winston H. [1 ]
机构
[1] Natl Taiwan Univ, Taipei 10617, Taiwan
关键词
Kernel; Videos; Task analysis; Convolution; Feature extraction; YouTube; Mathematical model; Computational and artificial intelligence; neural networks; feedforward neural networks; recurrent neural networks; ACTION RECOGNITION; FUSION;
D O I
10.1109/TMM.2019.2932564
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Action recognition greatly benefits motion understanding in video analysis. Recurrent networks such as long short-term memory (LSTM) networks are a popular choice for motion-aware sequence learning tasks. Recently, a convolutional extension of LSTM was proposed, in which input-to-hidden and hidden-to-hidden transitions are modeled through convolution with a single kernel. This implies an unavoidable trade-off between effectiveness and efficiency. Herein, we propose a new enhancement to convolutional LSTM networks that supports accommodation of multiple convolutional kernels and layers. This resembles a Network-in-LSTM approach, which improves upon the aforementioned concern. In addition, we propose an attention-based mechanism that is specifically designed for our multi-kernel extension. We evaluated our proposed extensions in a supervised classification setting on the UCF-101 and Sports-1M datasets, with the findings showing that our enhancements improve accuracy. We also undertook qualitative analysis to reveal the characteristics of our system and the convolutional LSTM baseline.
引用
收藏
页码:819 / 829
页数:11
相关论文
共 39 条
[31]  
Srivastava N, 2015, PR MACH LEARN RES, V37, P843
[32]  
Srivastava N, 2014, J MACH LEARN RES, V15, P1929
[33]  
Sutskever I, 2014, ADV NEUR IN, V27
[34]  
Szegedy C., 2015, P IEEE C COMP VIS PA, P1, DOI [10.1109/cvpr.2015.7298594, DOI 10.1109/CVPR.2015.7298594]
[35]   Sequence to Sequence - Video to Text [J].
Venugopalan, Subhashini ;
Rohrbach, Marcus ;
Donahue, Jeff ;
Mooney, Raymond ;
Darrell, Trevor ;
Saenko, Kate .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4534-4542
[36]   The Pose Knows: Video Forecasting by Generating Pose Futures [J].
Walker, Jacob ;
Marino, Kenneth ;
Gupta, Abhinav ;
Hebert, Martial .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3352-3361
[37]   Learning Attentional Recurrent Neural Network for Visual Tracking [J].
Wang, Qiurui ;
Yuan, Chun ;
Wang, Jingdong ;
Zeng, Wenjun .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (04) :930-942
[38]   Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length [J].
Wang, Xuanhan ;
Gao, Lianli ;
Wang, Peng ;
Sun, Xiaoshuai ;
Liu, Xianglong .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (03) :634-644
[39]   Diversified Visual Attention Networks for Fine-Grained Object Classification [J].
Zhao, Bo ;
Wu, Xiao ;
Feng, Jiashi ;
Peng, Qiang ;
Yan, Shuicheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (06) :1245-1256