VDTR: Video Deblurring With Transformer

被引:39
作者
Cao, Mingdeng [1 ]
Fan, Yanbo [2 ]
Zhang, Yong [2 ]
Wang, Jue [2 ]
Yang, Yujiu [1 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
[2] Tencent AI Lab, Shenzhen 518054, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Image restoration; Computational modeling; Adaptation models; Image reconstruction; Convolution; Video deblurring; vision transformer; spatio-temporal modeling; NETWORK;
D O I
10.1109/TCSVT.2022.3201045
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video deblurring is still an unsolved problem due to the challenging spatio-temporal modeling process. While existing convolutional neural network (CNN)-based methods show a limited capacity of effective spatial and temporal modeling for video deblurring. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt pure Transformer for video deblurring. VDTR exploits the superior long-range and relation modeling capabilities of Transformer for both spatial and temporal modeling. However, it is challenging to design an appropriate Transformer-based model for video deblurring due to the complicated non-uniform blurs, misalignment across multiple frames and the high computational costs for high-resolution spatial modeling. To address these problems, VDTR advocates performing attention within non-overlapping windows and exploiting the hierarchical structure for long-range dependencies modeling. For frame-level spatial modeling, we propose an encoder-decoder Transformer that utilizes multi-scale features for deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse multiple spatial features efficiently. Compared with CNN-based methods, the proposed method achieves highly competitive results on both synthetic and real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We hope such a Transformer-based architecture can serve as a powerful alternative baseline for video deblurring and other video restoration tasks.
引用
收藏
页码:160 / 171
页数:12
相关论文
共 52 条
[1]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[2]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[3]  
Chen H., 2021, arXiv, DOI DOI 10.48550/ARXIV.2012.00364
[4]   Video Deblurring for Hand-held Cameras Using Patch-based Synthesis [J].
Cho, Sunghyun ;
Wang, Jue ;
Lee, Seungyong .
ACM TRANSACTIONS ON GRAPHICS, 2012, 31 (04)
[5]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[6]   Tracking motion-blurred targets in video [J].
Dai, Shengyang ;
Yang, Ming ;
Wu, Ying ;
Katsaggelos, Aggelos K. .
2006 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP 2006, PROCEEDINGS, 2006, :2389-+
[7]  
Dosovitskiy A., 2021, An image is worth 16x16 words: Transformers for image recognition at scale
[8]  
Han K, 2021, Arxiv, DOI arXiv:2103.00112
[9]   Perceptual Losses for Real-Time Style Transfer and Super-Resolution [J].
Johnson, Justin ;
Alahi, Alexandre ;
Li Fei-Fei .
COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :694-711
[10]  
Joshi N., 2010, P IEEE INT C COMP PH, P1