Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation

被引：0

作者：

Ning, Xin ^{[1
]}

Cai, Feifan ^{[1
]}

Li, Yuhang ^{[1
]}

Ding, Youdong ^{[1
,2
]}

机构：

[1] Shanghai Univ, Coll Shanghai Film, 788 Guangzhong Rd, Shanghai 200072, Peoples R China

[2] Shanghai Engn Res Ctr Mot Picture Special Effects, 788 Guangzhong Rd, Shanghai 200072, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 10期

基金：

中国国家自然科学基金;

关键词：

video frame interpolation; spatio-temporal attention mechanism; Transformer; multi-scale information;

D O I：

10.3390/electronics13101981

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Traditional video frame interpolation methods based on deep convolutional neural networks face challenges in handling large motions. Their performance is limited by the fact that convolutional operations cannot directly integrate the rich temporal and spatial information of inter-frame pixels, and these methods rely heavily on additional inputs such as optical flow to model motion. To address this issue, we develop a novel framework for video frame interpolation that uses Transformer to efficiently model the long-range similarity of inter-frame pixels. Furthermore, to effectively aggregate spatio-temporal features, we design a novel attention mechanism divided into temporal attention and spatial attention. Specifically, spatial attention is used to aggregate intra-frame information, integrating both attention and convolution paradigms through the simple mapping approach. Temporal attention is used to model the similarity of pixels on the timeline. This design achieves parallel processing of these two types of information without extra computational cost, aggregating information in the space-time dimension. In addition, we introduce a context extraction network and multi-scale prediction frame synthesis network to further optimize the performance of the Transformer. Our method and state-of-the-art methods are extensively quantitatively and qualitatively experimented on various benchmark datasets. On the Vimeo90K and UCF101 datasets, our model achieves improvements of 0.09 dB and 0.01 dB in the PSNR metrics over UPR-Net-large, respectively. On the Vimeo90K dataset, our model outperforms FLAVR by 0.07 dB, with only 40.56% of its parameters. The qualitative results show that for complex and large-motion scenes, our method generates sharper and more realistic edges and details.

引用

页数：17

共 45 条

[1] A Database and Evaluation Methodology for Optical Flow
Baker, Simon
Scharstein, Daniel
Lewis, J. P.
Roth, Stefan
Black, Michael J.
Szeliski, Richard
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2011, 92 (01) : 1 - 31
[2] Depth-Aware Video Frame Interpolation
Bao, Wenbo
Lai, Wei-Sheng
Ma, Chao
Zhang, Xiaoyun
Gao, Zhiyong
Yang, Ming-Hsuan
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3698 - 3707
[3] MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement
Bao, Wenbo
Lai, Wei-Sheng
Zhang, Xiaoyun
Gao, Zhiyong
Yang, Ming-Hsuan
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) : 933 - 948
[4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[5] Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution
Cheng, Xianhang
Chen, Zhenzhong
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 7029 - 7045
[6] Cheng XH, 2020, AAAI CONF ARTIF INTE, V34, P10607
[7] Choi M, 2020, AAAI CONF ARTIF INTE, V34, P10663
[8] CDFI: Compression-Driven Network Design for Frame Interpolation
Ding, Tianyu
Liang, Luming
Zhu, Zhihui
Zharkov, Ilya
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7997 - 8007
[9] Ding X., 2024, ACM T ASIAN LOW-RESO, DOI [10.1145/3648364, DOI 10.1145/3648364]
[10] Fourure D, 2017, Arxiv, DOI arXiv:1707.07958

← 1 2 3 4 5 →