VRT: A Video Restoration Transformer

被引:31
作者
Liang, Jingyun [1 ]
Cao, Jiezhang [1 ]
Fan, Yuchen [2 ]
Zhang, Kai [1 ,3 ]
Ranjan, Rakesh [2 ]
Li, Yawei [1 ]
Timofte, Radu [1 ]
Van Gool, Luc [1 ,4 ]
机构
[1] Swiss Fed Inst Technol, D ITET, Comp Vis Lab, CH-8092 Zurich, Switzerland
[2] Meta Inc, Menlo Pk, CA 94025 USA
[3] Nanjing Univ, Sch Intelligence Sci & Technol, Suzhou Campus, Suzhou 215163, Peoples R China
[4] Katholieke Univ Leuven, Dept Elect Engn, Proc Speechand Images PSI, B-3001 Leuven, Belgium
关键词
Video restoration; video super-resolution; video deblurring; video denoising; video frame interpolation; spacetime video super-resolution; ENHANCEMENT; IMAGE;
D O I
10.1109/TIP.2024.3372454
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video restoration aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which are restricted by frame-by-frame restoration. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction ability. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal reciprocal self attention (TRSA) and parallel warping. TRSA divides the video into small clips, on which reciprocal attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on fourteen benchmark datasets. The codes are available at https://github.com/JingyunLiang/VRT.
引用
收藏
页码:2171 / 2182
页数:12
相关论文
共 77 条
  • [1] Video Denoising via Empirical Bayesian Estimation of Space-Time Patches
    Arias, Pablo
    Morel, Jean-Michel
    [J]. JOURNAL OF MATHEMATICAL IMAGING AND VISION, 2018, 60 (01) : 70 - 93
  • [2] Depth-Aware Video Frame Interpolation
    Bao, Wenbo
    Lai, Wei-Sheng
    Ma, Chao
    Zhang, Xiaoyun
    Gao, Zhiyong
    Yang, Ming-Hsuan
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3698 - 3707
  • [3] Nonlocal image and movie denoising
    Buades, Antoni
    Coll, Bartomeu
    Morel, Jean-Michel
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2008, 76 (02) : 123 - 139
  • [4] Chan KCK, 2021, Arxiv, DOI arXiv:2104.13371
  • [5] Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation
    Caballero, Jose
    Ledig, Christian
    Aitken, Andrew
    Acosta, Alejandro
    Totz, Johannes
    Wang, Zehan
    Shi, Wenzhe
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2848 - 2857
  • [6] Cao JZ, 2023, Arxiv, DOI arXiv:2106.06847
  • [7] BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond
    Chan, Kelvin C. K.
    Wang, Xintao
    Yu, Ke
    Dong, Chao
    Loy, Chen Change
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4945 - 4954
  • [8] Pre-Trained Image Processing Transformer
    Chen, Hanting
    Wang, Yunhe
    Guo, Tianyu
    Xu, Chang
    Deng, Yiping
    Liu, Zhenhua
    Ma, Siwei
    Xu, Chunjing
    Xu, Chao
    Gao, Wen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12294 - 12305
  • [9] Choi M, 2020, AAAI CONF ARTIF INTE, V34, P10663
  • [10] Deformable Convolutional Networks
    Dai, Jifeng
    Qi, Haozhi
    Xiong, Yuwen
    Li, Yi
    Zhang, Guodong
    Hu, Han
    Wei, Yichen
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 764 - 773