Second-order Temporal Pooling for Action Recognition

被引:21
作者
Cherian, Anoop [1 ]
Gould, Stephen [1 ]
机构
[1] Australian Natl Univ, Australian Ctr Robot Vis, Canberra, ACT, Australia
基金
澳大利亚研究理事会;
关键词
Action recognition; Deep Learning; Kernel descriptors; Second-order statistics; Pooling; Image Representations; End-to-end learning; Region covariance descriptors;
D O I
10.1007/s11263-018-1111-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics.Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.
引用
收藏
页码:340 / 362
页数:23
相关论文
共 95 条
[71]   Going deeper into action recognition: A survey [J].
Herath, Samitha ;
Harandi, Mehrtash ;
Porikli, Fatih .
IMAGE AND VISION COMPUTING, 2017, 60 :4-21
[72]   Bayesian model averaging: A tutorial [J].
Hoeting, JA ;
Madigan, D ;
Raftery, AE ;
Volinsky, CT .
STATISTICAL SCIENCE, 1999, 14 (04) :382-401
[73]  
J'egou H., 2009, CVPR
[74]   Bhattacharyya and expected likelihood kernels [J].
Jebara, T ;
Kondor, R .
LEARNING THEORY AND KERNEL MACHINES, 2003, 2777 :57-71
[75]   Product Quantization for Nearest Neighbor Search [J].
Jegou, Herve ;
Douze, Matthijs ;
Schmid, Cordelia .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (01) :117-128
[76]  
Karpathy A., 2014, CVPR
[77]  
Kay W, 2017, ARXIV
[78]  
Klaser A., 2008, P 19 BRIT MACH VIS C
[79]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[80]  
Lan T., 2015, ICCV"