Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

被引：78

作者：

Zhao, Shichao ^{[1
]}

Liu, Yanbin ^{[1
]}

Han, Yahong ^{[1
]}

Hong, Richang ^{[2
]}

Hu, Qinghua ^{[1
]}

Tian, Qi ^{[3
]}

机构：

[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin 300072, Peoples R China

[2] Hefei Univ Technol, Sch Comp & Informat, Hefei 230000, Anhui, Peoples R China

[3] Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2018年 / 28卷 / 08期

关键词：

ConvNets; pooling strategy; video representations; feature fusion; DENSE;

D O I：

10.1109/TCSVT.2017.2682196

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Deep ConvNets have shown their good performance in image classification tasks. However, there still remains problems in deep video representations for action recognition. On one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits their capability of capturing the complex video action information; on the other hand, temporal information of videos is not properly utilized to pool and encode the video sequences. Toward these issues, in this paper we utilize two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet [1]) and the temporal net from Two-Stream ConvNets [2], for action representation. The convolutional layers and the proposed new layer, called frame-diff layer, are extracted and pooled with two temporal pooling strategies: Trajectory pooling and Line pooling. The pooled local descriptors are then encoded with vector of locally aggregated descriptors (VLAD) [3] to form the video representations. In order to verify the effectiveness of the proposed framework, we conduct experiments on UCF101 and HMDB51 data sets. It achieves accuracy of 92.08% on UCF101, which is the state-of-the-art, and the accuracy of 65.62% on HMDB51, which is comparable to the state-of-the-art. In addition, we propose the new Line pooling strategy, which can speed up the extraction of feature and achieve the comparable performance of the Trajectory pooling.

引用

页码：1839 / 1849

页数：11

共 50 条

[1] Human Activity Analysis: A Review
Aggarwal, J. K.
Ryoo, M. S.
[J]. ACM COMPUTING SURVEYS, 2011, 43 (03)
[2] [Anonymous], 2015, CORR
[3] Speeded-Up Robust Features (SURF)
Bay, Herbert
Ess, Andreas
Tuytelaars, Tinne
Van Gool, Luc
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2008, 110 (03) : 346 - 359
[4] Multi-View Super Vector for Action Recognition
Cai, Zhuowei
Wang, Limin
Peng, Xiaojiang
Qiao, Yu
[J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 596 - 603
[5] Spatial-Bag-of-Features
Cao, Yang
Wang, Changhu
Li, Zhiwei
Zhang, Liqing
Zhang, Lei
[J]. 2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, : 3352 - 3359
[6] LIBSVM: A Library for Support Vector Machines
Chang, Chih-Chung
Lin, Chih-Jen
[J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[7] Chen M. Y., 2009, CMUCS09161 CARN MELL
[8] Donahue J, 2014, PR MACH LEARN RES, V32
[9] RANDOM SAMPLE CONSENSUS - A PARADIGM FOR MODEL-FITTING WITH APPLICATIONS TO IMAGE-ANALYSIS AND AUTOMATED CARTOGRAPHY
FISCHLER, MA
BOLLES, RC
[J]. COMMUNICATIONS OF THE ACM, 1981, 24 (06) : 381 - 395
[10] Compact and Discriminative Descriptor Inference Using Multi-Cues
Han, Yahong
Yang, Yi
Wu, Fei
Hong, Richang
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (12) : 5114 - 5126

← 1 2 3 4 5 →