TempFormer: Temporally Consistent Transformer for Video Denoising

被引:10
作者
Song, Mingyang [1 ,2 ]
Zhang, Yang [2 ]
Aydin, Tunc O. [2 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] DisneyRes Studios, Zurich, Switzerland
来源
COMPUTER VISION, ECCV 2022, PT XIX | 2022年 / 13679卷
关键词
Video denoising; Transformer; Temporal consistency;
D O I
10.1007/978-3-031-19800-7_28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video denoising is a low-level vision task that aims to restore high quality videos from noisy content. Vision Transformer (ViT) is a new machine learning architecture that has shown promising performance on both high-level and low-level image tasks. In this paper, we propose a modified ViT architecture for video processing tasks, introducing a new training strategy and loss function to enhance temporal consistency without compromising spatial quality. Specifically, we propose an efficient hybrid Transformer-based model, TempFormer, which composes Spatio-Temporal Transformer Blocks (STTB) and 3D convolutional layers. The proposed STTB learns the temporal information between neighboring frames implicitly by utilizing the proposed Joint Spatio-Temporal Mixer module for attention calculation and feature aggregation in each ViT block. Moreover, existing methods suffer from temporal inconsistency artifacts that are problematic in practical cases and distracting to the viewers. We propose a sliding block strategy with recurrent architecture, and use a new loss term, Overlap Loss, to alleviate the flickering between adjacent frames. Our method produces state-of-the-art spatio-temporal denoising quality with significantly improved temporal coherency, and requires less computational resources to achieve comparable denoising quality with competing methods (Fig. 1).
引用
收藏
页码:481 / 496
页数:16
相关论文
共 34 条
  • [21] Tassano M, 2019, IEEE IMAGE PROC, P1805, DOI [10.1109/icip.2019.8803136, 10.1109/ICIP.2019.8803136]
  • [22] Vaksman Gregory., 2021, Patch Craft: Video Denoising by Deep Modeling and Patch Matching
  • [23] Vaswani A, 2017, ADV NEUR IN, V30
  • [24] Wang C, 2020, Arxiv, DOI arXiv:2001.00346
  • [25] End-to-End Semi-Supervised Object Detection with Soft Teacher
    Xu, Mengde
    Zhang, Zheng
    Hu, Han
    Wang, Jianfeng
    Wang, Lijuan
    Wei, Fangyun
    Bai, Xiang
    Liu, Zicheng
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3040 - 3049
  • [26] Yang J., 2021, arXiv
  • [27] MetaFormer is Actually What You Need for Vision
    Yu, Weihao
    Luo, Mi
    Zhou, Pan
    Si, Chenyang
    Zhou, Yichen
    Wang, Xinchao
    Feng, Jiashi
    Yan, Shuicheng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10809 - 10819
  • [28] Yuan Lu, 2021, PREPRINT
  • [29] Supervised Raw Video Denoising with a Benchmark Dataset on Dynamic Scenes
    Yue, Huanjing
    Cao, Cong
    Liao, Lei
    Chu, Ronghe
    Yang, Jingyu
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2298 - 2307
  • [30] Restormer: Efficient Transformer for High-Resolution Image Restoration
    Zamir, Syed Waqas
    Arora, Aditya
    Khan, Salman
    Hayat, Munawar
    Khan, Fahad Shahbaz
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5718 - 5729