Constructing Hierarchical Spatiotemporal Information for Action Recognition

被引：0

作者：

Yao, Guangle ^{[1
,2
,3
]}

Zhong, Jiandan ^{[1
,2
,3
]}

Lei, Tao ^{[1
]}

Liu, Xianyuan ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Opt & Elect, Chengdu, Sichuan, Peoples R China

[2] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China

[3] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI) | 2018年

关键词：

action recognition; convolutional neural network; spatiotemporal information; action representation; optical flow; NETWORKS;

D O I：

10.1109/SmartWorld.2018.00123

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Video action recognition is widely applied in video indexing, intelligent surveillance, multimedia understanding, and other fields. Recently, it was greatly improved by incorporating the convolutional neural network (ConvNet). The features of shadow layers in ConvNet tend to model the apparent and motion of actions, and the features of deep layers tend to represent actions. In this paper, we propose to construct hierarchical information by combining the spatiotemporal features of shadow and deep layers in 3D ConvNet for action recognition. Specifically, we use Res3D to extract spatiotemporal information from different types of layers, and transfer the knowledge learned from RGB to optical flow field. We also propose a Parallel Pair Discriminant Correlation Analysis (PPDCA) to fuse the multiple layers' spatiotemporal information into a compact hierarchal action representation. The experimental results show that there is a good balance between accuracy and dimension in our proposed hierarchical spatiotemporal information, and our method not only outperforms the single layer Res3D methods but also achieves recognition performance comparable to that of state-of-the-art methods.

引用

页码：596 / 602

页数：7

共 36 条

[21] AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
Kar, Amlan
Rai, Nishant
Sikka, Karan
Sharma, Gaurav
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5699 - 5708
[22] Large-scale Video Classification with Convolutional Neural Networks
Karpathy, Andrej
Toderici, George
Shetty, Sanketh
Leung, Thomas
Sukthankar, Rahul
Fei-Fei, Li
[J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 1725 - 1732
[23] Krizhevsky A., 2017, COMMUN ACM, V60, P84, DOI [DOI 10.1145/3065386, 10.1145/3065386]
[24] Gradient-based learning applied to document recognition
Lecun, Y
Bottou, L
Bengio, Y
Haffner, P
[J]. PROCEEDINGS OF THE IEEE, 1998, 86 (11) : 2278 - 2324
[25] Ng JYH, 2015, PROC CVPR IEEE, P4694, DOI 10.1109/CVPR.2015.7299101
[26] Park E., 2016, P IEEE WINTER C APPL, P177
[27] Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice
Peng, Xiaojiang
Wang, Limin
Wang, Xingxing
Qiao, Yu
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2016, 150 : 109 - 125
[28] ImageNet Large Scale Visual Recognition Challenge
Russakovsky, Olga
Deng, Jia
Su, Hao
Krause, Jonathan
Satheesh, Sanjeev
Ma, Sean
Huang, Zhiheng
Karpathy, Andrej
Khosla, Aditya
Bernstein, Michael
Berg, Alexander C.
Fei-Fei, Li
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 115 (03) : 211 - 252
[29] Soomro Khurram., 2012, A Dataset of 101 Human Action Classes from Videos in the Wild, V2
[30] Szegedy Christian, 2015, P IEEE C COMP VIS PA, P1, DOI [10.1109/cvpr.2015.7298594, DOI 10.1109/CVPR.2015.7298594]

← 1 2 3 4 →