Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation

被引:0
作者
Ning, Xin [1 ]
Cai, Feifan [1 ]
Li, Yuhang [1 ]
Ding, Youdong [1 ,2 ]
机构
[1] Shanghai Univ, Coll Shanghai Film, 788 Guangzhong Rd, Shanghai 200072, Peoples R China
[2] Shanghai Engn Res Ctr Mot Picture Special Effects, 788 Guangzhong Rd, Shanghai 200072, Peoples R China
基金
中国国家自然科学基金;
关键词
video frame interpolation; spatio-temporal attention mechanism; Transformer; multi-scale information;
D O I
10.3390/electronics13101981
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional video frame interpolation methods based on deep convolutional neural networks face challenges in handling large motions. Their performance is limited by the fact that convolutional operations cannot directly integrate the rich temporal and spatial information of inter-frame pixels, and these methods rely heavily on additional inputs such as optical flow to model motion. To address this issue, we develop a novel framework for video frame interpolation that uses Transformer to efficiently model the long-range similarity of inter-frame pixels. Furthermore, to effectively aggregate spatio-temporal features, we design a novel attention mechanism divided into temporal attention and spatial attention. Specifically, spatial attention is used to aggregate intra-frame information, integrating both attention and convolution paradigms through the simple mapping approach. Temporal attention is used to model the similarity of pixels on the timeline. This design achieves parallel processing of these two types of information without extra computational cost, aggregating information in the space-time dimension. In addition, we introduce a context extraction network and multi-scale prediction frame synthesis network to further optimize the performance of the Transformer. Our method and state-of-the-art methods are extensively quantitatively and qualitatively experimented on various benchmark datasets. On the Vimeo90K and UCF101 datasets, our model achieves improvements of 0.09 dB and 0.01 dB in the PSNR metrics over UPR-Net-large, respectively. On the Vimeo90K dataset, our model outperforms FLAVR by 0.07 dB, with only 40.56% of its parameters. The qualitative results show that for complex and large-motion scenes, our method generates sharper and more realistic edges and details.
引用
收藏
页数:17
相关论文
共 45 条
  • [1] A Database and Evaluation Methodology for Optical Flow
    Baker, Simon
    Scharstein, Daniel
    Lewis, J. P.
    Roth, Stefan
    Black, Michael J.
    Szeliski, Richard
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2011, 92 (01) : 1 - 31
  • [2] Depth-Aware Video Frame Interpolation
    Bao, Wenbo
    Lai, Wei-Sheng
    Ma, Chao
    Zhang, Xiaoyun
    Gao, Zhiyong
    Yang, Ming-Hsuan
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3698 - 3707
  • [3] MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement
    Bao, Wenbo
    Lai, Wei-Sheng
    Zhang, Xiaoyun
    Gao, Zhiyong
    Yang, Ming-Hsuan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) : 933 - 948
  • [4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [5] Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution
    Cheng, Xianhang
    Chen, Zhenzhong
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 7029 - 7045
  • [6] Cheng XH, 2020, AAAI CONF ARTIF INTE, V34, P10607
  • [7] Choi M, 2020, AAAI CONF ARTIF INTE, V34, P10663
  • [8] CDFI: Compression-Driven Network Design for Frame Interpolation
    Ding, Tianyu
    Liang, Luming
    Zhu, Zhihui
    Zharkov, Ilya
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7997 - 8007
  • [9] Ding X., 2024, ACM T ASIAN LOW-RESO, DOI [10.1145/3648364, DOI 10.1145/3648364]
  • [10] Fourure D, 2017, Arxiv, DOI arXiv:1707.07958