End-to-End Learning of Motion Representation for Video Understanding

被引:167
作者
Fan, Lijie [1 ,2 ]
Huang, Wenbing [1 ]
Gan, Chuang [3 ]
Ermon, Stefano [4 ]
Gong, Boqing [1 ]
Huang, Junzhou [1 ]
机构
[1] Tencent AI Lab, Bellevue, WA 98004 USA
[2] Tsinghua Univ, Beijing, Peoples R China
[3] MIT, Watson Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[4] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
关键词
D O I
10.1109/CVPR.2018.00630
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks. To fill this gap, we propose TVNet, a novel end-to-end trainable neural network, to learn optical-flow-like features from data. TVNet subsumes a specific optical flow solver, the TV-L1 method, and is initialized by unfolding its optimization iterations as neural layers. TVNet can therefore be used directly without any extra learning. Moreover, it can be naturally concatenated with other task-specific networks to formulate an end-to-end architecture, thus making our method more efficient than current multi-stage approaches by avoiding the need to pre-compute and store features on disk. Finally, the parameters of the TVNet can be further fine-tuned by end-to-end training. This enables TVNet to learn richer and task-specific patterns beyond exact optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach. Our TVNet achieves better accuracies than all compared methods, while being competitive with the fastest counterpart in terms of features extraction time.
引用
收藏
页码:6016 / 6025
页数:10
相关论文
共 45 条
  • [1] Abadi M., 2016, TENSORFLOW LARGESCAL
  • [2] [Anonymous], 2015, Delving deeper into convolutional networks for learning video representations
  • [3] [Anonymous], 2016, ECCV, DOI DOI 10.1007/978-3-319-46487-9_52
  • [4] [Anonymous], 2016, ARXIV161100850
  • [5] [Anonymous], 2015, CORR
  • [6] [Anonymous], 2017, ARXIV170803805
  • [7] [Anonymous], 2016, Advances in Neural Information Processing Systems
  • [8] Baker Simon, 2007, 2007 11th IEEE International Conference on Computer Vision, P1
  • [9] BILEN H, 2016, PROC CVPR IEEE, P3034, DOI DOI 10.1109/CVPR.2016.331
  • [10] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733