VDTR: Video Deblurring With Transformer

被引:29
作者
Cao, Mingdeng [1 ]
Fan, Yanbo [2 ]
Zhang, Yong [2 ]
Wang, Jue [2 ]
Yang, Yujiu [1 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
[2] Tencent AI Lab, Shenzhen 518054, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Image restoration; Computational modeling; Adaptation models; Image reconstruction; Convolution; Video deblurring; vision transformer; spatio-temporal modeling; NETWORK;
D O I
10.1109/TCSVT.2022.3201045
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video deblurring is still an unsolved problem due to the challenging spatio-temporal modeling process. While existing convolutional neural network (CNN)-based methods show a limited capacity of effective spatial and temporal modeling for video deblurring. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt pure Transformer for video deblurring. VDTR exploits the superior long-range and relation modeling capabilities of Transformer for both spatial and temporal modeling. However, it is challenging to design an appropriate Transformer-based model for video deblurring due to the complicated non-uniform blurs, misalignment across multiple frames and the high computational costs for high-resolution spatial modeling. To address these problems, VDTR advocates performing attention within non-overlapping windows and exploiting the hierarchical structure for long-range dependencies modeling. For frame-level spatial modeling, we propose an encoder-decoder Transformer that utilizes multi-scale features for deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse multiple spatial features efficiently. Compared with CNN-based methods, the proposed method achieves highly competitive results on both synthetic and real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We hope such a Transformer-based architecture can serve as a powerful alternative baseline for video deblurring and other video restoration tasks.
引用
收藏
页码:160 / 171
页数:12
相关论文
共 52 条
[1]  
[Anonymous], 2020, ARXIV
[2]  
Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[3]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[4]  
Chen H., 2020, arXiv, DOI 10.48550/arxiv.2012.00364
[5]   Video Deblurring for Hand-held Cameras Using Patch-based Synthesis [J].
Cho, Sunghyun ;
Wang, Jue ;
Lee, Seungyong .
ACM TRANSACTIONS ON GRAPHICS, 2012, 31 (04)
[6]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[7]   Tracking motion-blurred targets in video [J].
Dai, Shengyang ;
Yang, Ming ;
Wu, Ying ;
Katsaggelos, Aggelos K. .
2006 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP 2006, PROCEEDINGS, 2006, :2389-+
[8]  
Han K, 2021, Arxiv, DOI arXiv:2103.00112
[9]   Perceptual Losses for Real-Time Style Transfer and Super-Resolution [J].
Johnson, Justin ;
Alahi, Alexandre ;
Li Fei-Fei .
COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :694-711
[10]  
Joshi N., 2010, Proc. IEEE International Conference on Computational Photography (ICCP), P1