Recurrent Spatiotemporal Feature Learning for Action Recognition

被引：5

作者：

Chen, Ze ^{[1
]}

Lu, Hongtao ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligence Intera, Dept Comp Sci & Engn, Shanghai, Peoples R China

来源：

ICRAI 2018: PROCEEDINGS OF 2018 4TH INTERNATIONAL CONFERENCE ON ROBOTICS AND ARTIFICIAL INTELLIGENCE - | 2018年

关键词：

Action recognition; video analysis; convolutional LSTM; residual network;

D O I：

10.1145/3297097.3297107

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recurrent neural networks (RNNs) like Long Short-term Memory (LSTM) have shown excellent performance for a variety of sequence learning problems on language and speech processing. Previous works leveraging RNNs for action recognition mainly apply LSTM on the top of Convolutional Neural Networks (CNNs), feeding high level semantic feature to RNNs and neglecting to learn spatiotemporal features in video, which makes the models unable to capture complex action patterns and lead to inferior performance. In this work, instead of adopting RNNs as classifiers, we propose to learn spatiotemporal feature in a recurrent way by replacing intermediate convolutional layers in CNNs with recurrent layers to model spatiotemporal information. In order to learn discriminative spatiotemporal features, we extend the bottleneck structure of Residual Network (ResNet) to model the spatiotemporal information in action video. To fully utilize the pretrained CNN model, we also introduce approaches to transfer the weight of original convolutional layer to our proposed model. Our proposed architecture is end-to-end trainable, and of significant flexibility to be adapted in any CNN-based structure. Our model produces the state-of-the-art performance on two standard benchmark for action recognition over RNN-based approaches.

引用

页码：12 / 17

页数：6

共 35 条

[21]

Kay W, 2017, ARXIV

[22] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[23]

Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543

[24] Gradient-based learning applied to document recognition [J].

Lecun, Y ;

Bottou, L ;

Bengio, Y ;

Haffner, P .

PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2278-2324

[25] VideoLSTM convolves, attends and flows for action recognition [J].

Li, Zhenyang ;

Gavrilyuk, Kirill ;

Gavves, Efstratios ;

Jain, Mihir ;

Snoek, Cees G. M. .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 166 :41-50

[26] Human activity recognition for video surveillance [J].

Lin, Weiyao ;

Sun, Ming-Ting ;

Poovandran, Radha ;

Zhang, Zhengyou .

PROCEEDINGS OF 2008 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-10, 2008, :2737-+

[27]

Mikolov T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1045

[28]

Ng JYH, 2015, PROC CVPR IEEE, P4694, DOI 10.1109/CVPR.2015.7299101

[29]

Niebles JC, 2010, LECT NOTES COMPUT SC, V6312, P392, DOI 10.1007/978-3-642-15552-9_29

[30]

Shi X., 2015, ADV NEURAL INFORM PR, P802, DOI [10.5555/2969239.2969329, DOI 10.48550/ARXIV.1506.04214]

← 1 2 3 4 →