Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

被引:78
作者
Zhao, Shichao [1 ]
Liu, Yanbin [1 ]
Han, Yahong [1 ]
Hong, Richang [2 ]
Hu, Qinghua [1 ]
Tian, Qi [3 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin 300072, Peoples R China
[2] Hefei Univ Technol, Sch Comp & Informat, Hefei 230000, Anhui, Peoples R China
[3] Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA
关键词
ConvNets; pooling strategy; video representations; feature fusion; DENSE;
D O I
10.1109/TCSVT.2017.2682196
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Deep ConvNets have shown their good performance in image classification tasks. However, there still remains problems in deep video representations for action recognition. On one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits their capability of capturing the complex video action information; on the other hand, temporal information of videos is not properly utilized to pool and encode the video sequences. Toward these issues, in this paper we utilize two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet [1]) and the temporal net from Two-Stream ConvNets [2], for action representation. The convolutional layers and the proposed new layer, called frame-diff layer, are extracted and pooled with two temporal pooling strategies: Trajectory pooling and Line pooling. The pooled local descriptors are then encoded with vector of locally aggregated descriptors (VLAD) [3] to form the video representations. In order to verify the effectiveness of the proposed framework, we conduct experiments on UCF101 and HMDB51 data sets. It achieves accuracy of 92.08% on UCF101, which is the state-of-the-art, and the accuracy of 65.62% on HMDB51, which is comparable to the state-of-the-art. In addition, we propose the new Line pooling strategy, which can speed up the extraction of feature and achieve the comparable performance of the Trajectory pooling.
引用
收藏
页码:1839 / 1849
页数:11
相关论文
共 50 条
  • [1] Human Activity Analysis: A Review
    Aggarwal, J. K.
    Ryoo, M. S.
    [J]. ACM COMPUTING SURVEYS, 2011, 43 (03)
  • [2] [Anonymous], 2015, CORR
  • [3] Speeded-Up Robust Features (SURF)
    Bay, Herbert
    Ess, Andreas
    Tuytelaars, Tinne
    Van Gool, Luc
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2008, 110 (03) : 346 - 359
  • [4] Multi-View Super Vector for Action Recognition
    Cai, Zhuowei
    Wang, Limin
    Peng, Xiaojiang
    Qiao, Yu
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 596 - 603
  • [5] Spatial-Bag-of-Features
    Cao, Yang
    Wang, Changhu
    Li, Zhiwei
    Zhang, Liqing
    Zhang, Lei
    [J]. 2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, : 3352 - 3359
  • [6] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [7] Chen M. Y., 2009, CMUCS09161 CARN MELL
  • [8] Donahue J, 2014, PR MACH LEARN RES, V32
  • [9] RANDOM SAMPLE CONSENSUS - A PARADIGM FOR MODEL-FITTING WITH APPLICATIONS TO IMAGE-ANALYSIS AND AUTOMATED CARTOGRAPHY
    FISCHLER, MA
    BOLLES, RC
    [J]. COMMUNICATIONS OF THE ACM, 1981, 24 (06) : 381 - 395
  • [10] Compact and Discriminative Descriptor Inference Using Multi-Cues
    Han, Yahong
    Yang, Yi
    Wu, Fei
    Hong, Richang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (12) : 5114 - 5126