Convolutional Two-Stream Network Fusion for Video Action Recognition

被引:1898
作者
Feichtenhofer, Christoph [1 ]
Pinz, Axel [1 ]
Zisserman, Andrew [2 ]
机构
[1] Graz Univ Technol, Graz, Austria
[2] Univ Oxford, Oxford, England
来源
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2016年
基金
英国工程与自然科学研究理事会; 奥地利科学基金会;
关键词
D O I
10.1109/CVPR.2016.213
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.
引用
收藏
页码:1933 / 1941
页数:9
相关论文
共 36 条
  • [1] [Anonymous], 2015, COMPUTER VISION PATT
  • [2] [Anonymous], 2014, P BMVC
  • [3] [Anonymous], P ICCV
  • [4] [Anonymous], 2008, P CVPR
  • [5] [Anonymous], CRCVTR1201 UCF
  • [6] [Anonymous], 2015, P CVPR
  • [7] [Anonymous], 2010, P ECCV
  • [8] [Anonymous], 2014, NIPS
  • [9] [Anonymous], P ECCV
  • [10] [Anonymous], 2015, P CVPR