Action Recognition with Dynamic Image Networks

被引:149
作者
Bilen, Hakan [1 ]
Fernando, Basura [2 ]
Gavves, Efstratios [3 ]
Vedaldi, Andrea [4 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh EH8 9YL, Midlothian, Scotland
[2] Australian Natl Univ, ACRV, Res Sch Engn, Canberra, ACT 2601, Australia
[3] Univ Amsterdam, QUVA Lab, NL-1012 WX Amsterdam, Netherlands
[4] Univ Oxford, VGG, Oxford OX1 2JD, England
基金
英国工程与自然科学研究理事会; 澳大利亚研究理事会;
关键词
Human action classification; video classification; motion representation; deep learning; convolutional neural networks; HISTOGRAMS;
D O I
10.1109/TPAMI.2017.2769085
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. We call the resulting representation dynamic image because it summarizes the video dynamics in addition to appearance. This powerful idea allows to convert any video to an image so that existing CNN models pre-trained with still images can be immediately extended to videos. We also present an efficient approximate rank pooling operator that runs two orders of magnitude faster than the standard ones with any loss in ranking performance and can be formulated as a CNN layer. To demonstrate the power of the representation, we introduce a novel four stream CNN architecture which can learn from RGB and optical flow frames as well as from their dynamic image representations. We show that the proposed network achieves state-of-the-art performance, 95.5 and 72.5 percent accuracy, in the UCF101 and HMDB51 respectively.
引用
收藏
页码:2799 / 2813
页数:15
相关论文
共 78 条
[1]   Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning [J].
Ali, Saad ;
Shah, Mubarak .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (02) :288-303
[2]  
[Anonymous], 2016, Advances in Neural Information Processing Systems
[3]  
BILEN H, 2016, PROC CVPR IEEE, P3034, DOI DOI 10.1109/CVPR.2016.331
[4]   The recognition of human movement using temporal templates [J].
Bobick, AF ;
Davis, JW .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) :257-267
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]   P-CNN: Pose-based CNN Features for Action Recognition [J].
Cheron, Guilhem ;
Laptev, Ivan ;
Schmid, Cordelia .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :3218-3226
[7]   Learning a similarity metric discriminatively, with application to face verification [J].
Chopra, S ;
Hadsell, R ;
LeCun, Y .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :539-546
[8]  
Courtney PG, 2015, IEEE COMP SEMICON
[9]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[10]   Human detection using oriented histograms of flow and appearance [J].
Dalal, Navneet ;
Triggs, Bill ;
Schmid, Cordelia .
COMPUTER VISION - ECCV 2006, PT 2, PROCEEDINGS, 2006, 3952 :428-441