Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos

被引:162
作者
Du, Wenbin [1 ,2 ]
Wang, Yali [2 ]
Qiao, Yu [2 ,3 ]
机构
[1] Univ Chinese Acad Sci, Shenzhen Coll Adv Technol, Shenzhen, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Provincial Key Lab Comp Vis & Virtual R, Shenzhen 518000, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; RSTAN; spatial-temporal attention; attention-driven fusion; actor-attention regularization;
D O I
10.1109/TIP.2017.2778563
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed the popularity of using recurrent neural network (RNN) for action recognition in videos. However, videos are of high dimensionality and contain rich human dynamics with various motion scales, which makes the traditional RNNs difficult to capture complex action information. In this paper, we propose a novel recurrent spatial-temporal attention network (RSTAN) to address this challenge, where we introduce a spatial-temporal attention mechanism to adaptively identify key features from the global video context for every time-step prediction of RNN. More specifically, we make three main contributions from the following aspects. First, we reinforce the classical long short-term memory (LSTM) with a novel spatial-temporal attention module. At each time step, our module can automatically learn a spatial-temporal action representation from all sampled video frames, which is compact and highly relevant to the prediction at the current step. Second, we design an attention-driven appearance-motion fusion strategy to integrate appearance and motion LSTMs into a unified framework, where LSTMs with their spatial-temporal attention modules in two streams can be jointly trained in an end-to-end fashion. Third, we develop actor-attention regularization for RSTAN, which can guide our attention mechanism to focus on the important action regions around actors. We evaluate the proposed RSTAN on the benchmark UCF101, HMDB51 and JHMDB data sets. The experimental results show that, our RSTAN outperforms other recent RNN-based approaches on UCF101 and HMDB51 as well as achieves the state-of-the-art on JHMDB.
引用
收藏
页码:1347 / 1360
页数:14
相关论文
共 72 条
[41]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
[42]   Towards understanding action recognition [J].
Jhuang, Hueihan ;
Gall, Juergen ;
Zuffi, Silvia ;
Schmid, Cordelia ;
Black, Michael J. .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :3192-3199
[43]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732
[44]  
Krizhevsky A., 2017, COMMUN ACM, V60, P84, DOI [DOI 10.1145/3065386, 10.1145/3065386]
[45]  
Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543
[46]  
Lan Zhenzhong., 2017, Deep Local Video Feature for Action Recognition
[47]   On space-time interest points [J].
Laptev, I .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2005, 64 (2-3) :107-123
[48]  
Ma Chih-Yao., 2017, Ts-lstm and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition
[49]   Multi-region Two-Stream R-CNN for Action Detection [J].
Peng, Xiaojiang ;
Schmid, Cordelia .
COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 :744-759
[50]  
Peng XJ, 2014, LECT NOTES COMPUT SC, V8693, P581, DOI 10.1007/978-3-319-10602-1_38