VDTR: Video Deblurring With Transformer

被引：39

作者：

Cao, Mingdeng ^{[1
]}

Fan, Yanbo ^{[2
]}

Zhang, Yong ^{[2
]}

Wang, Jue ^{[2
]}

Yang, Yujiu ^{[1
]}

机构：

[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China

[2] Tencent AI Lab, Shenzhen 518054, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Transformers; Feature extraction; Image restoration; Computational modeling; Adaptation models; Image reconstruction; Convolution; Video deblurring; vision transformer; spatio-temporal modeling; NETWORK;

D O I：

10.1109/TCSVT.2022.3201045

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video deblurring is still an unsolved problem due to the challenging spatio-temporal modeling process. While existing convolutional neural network (CNN)-based methods show a limited capacity of effective spatial and temporal modeling for video deblurring. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt pure Transformer for video deblurring. VDTR exploits the superior long-range and relation modeling capabilities of Transformer for both spatial and temporal modeling. However, it is challenging to design an appropriate Transformer-based model for video deblurring due to the complicated non-uniform blurs, misalignment across multiple frames and the high computational costs for high-resolution spatial modeling. To address these problems, VDTR advocates performing attention within non-overlapping windows and exploiting the hierarchical structure for long-range dependencies modeling. For frame-level spatial modeling, we propose an encoder-decoder Transformer that utilizes multi-scale features for deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse multiple spatial features efficiently. Compared with CNN-based methods, the proposed method achieves highly competitive results on both synthetic and real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We hope such a Transformer-based architecture can serve as a powerful alternative baseline for video deblurring and other video restoration tasks.

引用

页码：160 / 171

页数：12

共 52 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[3]

Chen H., 2021, arXiv, DOI DOI 10.48550/ARXIV.2012.00364

[4] Video Deblurring for Hand-held Cameras Using Patch-based Synthesis [J].

Cho, Sunghyun ;

Wang, Jue ;

Lee, Seungyong .

ACM TRANSACTIONS ON GRAPHICS, 2012, 31 (04)

[5] Deformable Convolutional Networks [J].

Dai, Jifeng ;

Qi, Haozhi ;

Xiong, Yuwen ;

Li, Yi ;

Zhang, Guodong ;

Hu, Han ;

Wei, Yichen .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773

[6] Tracking motion-blurred targets in video [J].

Dai, Shengyang ;

Yang, Ming ;

Wu, Ying ;

Katsaggelos, Aggelos K. .

2006 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP 2006, PROCEEDINGS, 2006, :2389-+

[7]

Dosovitskiy A., 2021, An image is worth 16x16 words: Transformers for image recognition at scale

[8]

Han K, 2021, Arxiv, DOI arXiv:2103.00112

[9] Perceptual Losses for Real-Time Style Transfer and Super-Resolution [J].

Johnson, Justin ;

Alahi, Alexandre ;

Li Fei-Fei .

COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :694-711

[10]

Joshi N., 2010, P IEEE INT C COMP PH, P1

← 1 2 3 4 5 6 →