Second-order Temporal Pooling for Action Recognition

被引：21

作者：

Cherian, Anoop ^{[1
]}

Gould, Stephen ^{[1
]}

机构：

[1] Australian Natl Univ, Australian Ctr Robot Vis, Canberra, ACT, Australia

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2019年 / 127卷 / 04期

基金：

澳大利亚研究理事会;

关键词：

Action recognition; Deep Learning; Kernel descriptors; Second-order statistics; Pooling; Image Representations; End-to-end learning; Region covariance descriptors;

D O I：

10.1007/s11263-018-1111-5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics.Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.

引用

页码：340 / 362

页数：23

共 95 条

[71] Going deeper into action recognition: A survey [J].

Herath, Samitha ;

Harandi, Mehrtash ;

Porikli, Fatih .

IMAGE AND VISION COMPUTING, 2017, 60 :4-21

[72] Bayesian model averaging: A tutorial [J].

Hoeting, JA ;

Madigan, D ;

Raftery, AE ;

Volinsky, CT .

STATISTICAL SCIENCE, 1999, 14 (04) :382-401

[73]

J'egou H., 2009, CVPR

[74] Bhattacharyya and expected likelihood kernels [J].

Jebara, T ;

Kondor, R .

LEARNING THEORY AND KERNEL MACHINES, 2003, 2777 :57-71

[75] Product Quantization for Nearest Neighbor Search [J].

Jegou, Herve ;

Douze, Matthijs ;

Schmid, Cordelia .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (01) :117-128

[76]

Karpathy A., 2014, CVPR

[77]

Kay W, 2017, ARXIV

[78]

Klaser A., 2008, P 19 BRIT MACH VIS C

[79] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[80]

Lan T., 2015, ICCV"

← 1 2 3 4 5 6 7 8 9 10 →