Motion-Aware Feature Enhancement Network for Video Prediction

被引:20
作者
Lin, Xue [1 ]
Zou, Qi [1 ]
Xu, Xixia [1 ]
Huang, Yaping [1 ]
Tian, Yi [1 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Traff Data Anal & Min, Beijing 100044, Peoples R China
关键词
Predictive models; Encoding; Multiprotocol label switching; Stochastic processes; Dynamics; Feature extraction; Task analysis; Video prediction; unsupervised learning; attention mechanism; perceptual loss;
D O I
10.1109/TCSVT.2020.2987141
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video prediction is challenging, due to the pixel-level precision requirement and the difficulty in capturing scene dynamics. Most approaches tackle the problems by pixel-level reconstruction objectives and two decomposed branches, which still suffer from blurry generations or dramatic degradations in long-term prediction. In this paper, we propose a Motion-Aware Feature Enhancement (MAFE) network for video prediction to produce realistic future frames and achieve relatively long-term predictions. First, a Channel-wise and Spatial Attention (CSA) module is designed to extract motion-aware features, which enhances the contribution of important motion details during encoding, and subsequently improves the discriminability of attention map for the frame refinement. Second, a Motion Perceptual Loss (MPL) is proposed to guide the learning of temporal cues, which benefits to robust long-term video prediction. Extensive experiments on three human activity video datasets: KTH, Human3.6M, and PennAction demonstrate the effectiveness of the proposed video prediction model compared with the state-of-the-art approaches.
引用
收藏
页码:688 / 700
页数:13
相关论文
共 57 条
  • [1] Nguyen A, 2015, PROC CVPR IEEE, P427, DOI 10.1109/CVPR.2015.7298640
  • [2] Long short-term memory
    Hochreiter, S
    Schmidhuber, J
    [J]. NEURAL COMPUTATION, 1997, 9 (08) : 1735 - 1780
  • [3] Babaeizadeh M., 2018, 6 INT C LEARN REPRES, P1
  • [4] ContextVP: Fully Context-Aware Video Prediction
    Byeon, Wonmin
    Wang, Qin
    Srivastava, Rupesh Kumar
    Koumoutsakos, Petros
    [J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 781 - 797
  • [5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
    Chen, Long
    Zhang, Hanwang
    Xiao, Jun
    Nie, Liqiang
    Shao, Jian
    Liu, Wei
    Chua, Tat-Seng
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
  • [6] MARS: Motion-Augmented RGB Stream for Action Recognition
    Crasto, Nieves
    Weinzaepfel, Philippe
    Alahari, Karteek
    Schmid, Cordelia
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7874 - 7883
  • [7] Denton E., 2018, P INT C MACH LEARN, P1
  • [8] Ebert F., 2017, C ROB LEARN CORL
  • [9] Feichtenhofer C, 2016, ADV NEUR IN, V29
  • [10] Convolutional Two-Stream Network Fusion for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Zisserman, Andrew
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941