VRT: A Video Restoration Transformer

被引:31
作者
Liang, Jingyun [1 ]
Cao, Jiezhang [1 ]
Fan, Yuchen [2 ]
Zhang, Kai [1 ,3 ]
Ranjan, Rakesh [2 ]
Li, Yawei [1 ]
Timofte, Radu [1 ]
Van Gool, Luc [1 ,4 ]
机构
[1] Swiss Fed Inst Technol, D ITET, Comp Vis Lab, CH-8092 Zurich, Switzerland
[2] Meta Inc, Menlo Pk, CA 94025 USA
[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou Campus, Suzhou 215163, Peoples R China
[4] Katholieke Univ Leuven, Dept Elect Engn, Proc Speechand Images PSI, B-3001 Leuven, Belgium
关键词
Video restoration; video super-resolution; video deblurring; video denoising; video frame interpolation; spacetime video super-resolution; ENHANCEMENT; IMAGE;
D O I
10.1109/TIP.2024.3372454
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video restoration aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which are restricted by frame-by-frame restoration. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction ability. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal reciprocal self attention (TRSA) and parallel warping. TRSA divides the video into small clips, on which reciprocal attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on fourteen benchmark datasets. The codes are available at https://github.com/JingyunLiang/VRT.
引用
收藏
页码:2171 / 2182
页数:12
相关论文
共 77 条
  • [11] FlowNet: Learning Optical Flow with Convolutional Networks
    Dosovitskiy, Alexey
    Fischer, Philipp
    Ilg, Eddy
    Haeusser, Philip
    Hazirbas, Caner
    Golkov, Vladimir
    van der Smagt, Patrick
    Cremers, Daniel
    Brox, Thomas
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2758 - 2766
  • [12] Efficient Video Super-Resolution through Recurrent Latent Space Propagation
    Fuoli, Dario
    Gu, Shuhang
    Timofte, Radu
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3476 - 3485
  • [13] RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution
    Geng, Zhicheng
    Liang, Luming
    Ding, Tianyu
    Zharkov, Ilya
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17420 - 17430
  • [14] Space-Time-Aware Multi-Resolution Video Enhancement
    Haris, Muhammad
    Shakhnarovich, Greg
    Ukita, Norimichi
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2856 - 2865
  • [15] Recurrent Back-Projection Network for Video Super-Resolution
    Haris, Muhammad
    Shakhnarovich, Greg
    Ukita, Norimichi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3892 - 3901
  • [16] Huang Y, 2015, ADV NEUR IN, V28
  • [17] Isobe Takashi, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12357), P645, DOI 10.1007/978-3-030-58610-2_38
  • [18] Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
    Jiang, Huaizu
    Sun, Deqing
    Jampani, Varun
    Yang, Ming-Hsuan
    Learned-Miller, Erik
    Kautz, Jan
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 9000 - 9008
  • [19] Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation
    Jo, Younghyun
    Oh, Seoung Wug
    Kang, Jaeyeon
    Kim, Seon Joo
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3224 - 3232
  • [20] Kalluri T., 2020, arXiv