Recurrent Spatiotemporal Feature Learning for Action Recognition

被引:5
作者
Chen, Ze [1 ]
Lu, Hongtao [1 ]
机构
[1] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligence Intera, Dept Comp Sci & Engn, Shanghai, Peoples R China
来源
ICRAI 2018: PROCEEDINGS OF 2018 4TH INTERNATIONAL CONFERENCE ON ROBOTICS AND ARTIFICIAL INTELLIGENCE - | 2018年
关键词
Action recognition; video analysis; convolutional LSTM; residual network;
D O I
10.1145/3297097.3297107
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recurrent neural networks (RNNs) like Long Short-term Memory (LSTM) have shown excellent performance for a variety of sequence learning problems on language and speech processing. Previous works leveraging RNNs for action recognition mainly apply LSTM on the top of Convolutional Neural Networks (CNNs), feeding high level semantic feature to RNNs and neglecting to learn spatiotemporal features in video, which makes the models unable to capture complex action patterns and lead to inferior performance. In this work, instead of adopting RNNs as classifiers, we propose to learn spatiotemporal feature in a recurrent way by replacing intermediate convolutional layers in CNNs with recurrent layers to model spatiotemporal information. In order to learn discriminative spatiotemporal features, we extend the bottleneck structure of Residual Network (ResNet) to model the spatiotemporal information in action video. To fully utilize the pretrained CNN model, we also introduce approaches to transfer the weight of original convolutional layer to our proposed model. Our proposed architecture is end-to-end trainable, and of significant flexibility to be adapted in any CNN-based structure. Our model produces the state-of-the-art performance on two standard benchmark for action recognition over RNN-based approaches.
引用
收藏
页码:12 / 17
页数:6
相关论文
共 35 条
[1]  
[Anonymous], 2012, CRCV T 12 01
[2]  
[Anonymous], 2016, P ADV NEUR INF PROC
[3]  
[Anonymous], 1997, Neural Computation
[4]  
[Anonymous], 2017, ARXIV170803958
[5]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[6]   Detecting irregularities in images and in video [J].
Boiman, Oren ;
Irani, Michal .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2007, 74 (01) :17-31
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]  
Feichtenhofer C., 2016, P INT C NEUR INF PRO, P3468