VLAD3: Encoding Dynamics of Deep Features for Action Recognition

被引：59

作者：

Li, Yingwei ^{[1
]}

Li, Weixin ^{[1
]}

Mahadevan, Vijay

Vasconcelos, Nuno ^{[1
]}

机构：

[1] Univ Calif San Diego, San Diego, CA 92103 USA

来源：

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2016年

关键词：

VIDEO;

D O I：

10.1109/CVPR.2016.215

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Previous approaches to action recognition with deep features tend to process video frames only within a small temporal region, and do not model long-range dynamic information explicitly. However, such information is important for the accurate recognition of actions, especially for the discrimination of complex activities that share sub-actions, and when dealing with untrimmed videos. Here, we propose a representation, VLAD for Deep Dynamics (VLAD(3)), that accounts for different levels of video dynamics. It captures short-term dynamics with deep convolutional neural network features, relying on linear dynamic systems ( LDS) to model medium-range dynamics. To account for long-range inhomogeneous dynamics, a VLAD descriptor is derived for the LDS and pooled over the whole video, to arrive at the final VLAD3 representation. An extensive evaluation was performed on Olympic Sports, UCF101 and THUMOS15, where the use of the VLAD3 representation leads to state-of-the-art results.

引用

页码：1951 / 1960

页数：10