Unsupervised Low-Light Video Enhancement With Spatial-Temporal Co-Attention Transformer

被引:3
|
作者
Lv, Xiaoqian [1 ]
Zhang, Shengping [1 ]
Wang, Chenyang [1 ]
Zhang, Weigang [1 ]
Yao, Hongxun [2 ]
Huang, Qingming [3 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Weihai 264209, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin 150001, Peoples R China
[3] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Low-light video enhancement; unsupervised learning; curve estimation; transformer; IMAGE QUALITY ASSESSMENT; REPRESENTATION; FRAMEWORK; ALGORITHM;
D O I
10.1109/TIP.2023.3301332
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing low-light video enhancement methods are dominated by Convolution Neural Networks (CNNs) that are trained in a supervised manner. Due to the difficulty of collecting paired dynamic low/normal-light videos in real-world scenes, they are usually trained on synthetic, static, and uniform motion videos, which undermines their generalization to real-world scenes. Additionally, these methods typically suffer from temporal inconsistency (e.g., flickering artifacts and motion blurs) when handling large-scale motions since the local perception property of CNNs limits them to model long-range dependencies in both spatial and temporal domains. To address these problems, we propose the first unsupervised method for low-light video enhancement to our best knowledge, named LightenFormer, which models long-range intra- and inter-frame dependencies with a spatial-temporal co-attention transformer to enhance brightness while maintaining temporal consistency. Specifically, an effective but lightweight S-curve Estimation Network (SCENet) is first proposed to estimate pixel-wise S-shaped non-linear curves (S-curves) to adaptively adjust the dynamic range of an input video. Next, to model the temporal consistency of the video, we present a Spatial-Temporal Refinement Network (STRNet) to refine the enhanced video. The core module of STRNet is a novel Spatial-Temporal Co-attention Transformer (STCAT), which exploits multi-scale self- and cross-attention interactions to capture long-range correlations in both spatial and temporal domains among frames for implicit motion estimation. To achieve unsupervised training, we further propose two non-reference loss functions based on the invertibility of the S-curve and the noise independence among frames. Extensive experiments on the SDSD and LLIV-Phone datasets demonstrate that our LightenFormer outperforms state-of-the-art methods.
引用
收藏
页码:4701 / 4715
页数:15
相关论文
共 50 条
  • [1] DSFormer: Leveraging Transformer with Cross-Modal Attention for Temporal Consistency in Low-Light Video Enhancement
    Xu, JiaHao
    Mei, ShuHao
    Chen, ZiZheng
    Zhang, DanNi
    Shi, Fan
    Zhao, Meng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024, 2024, 14872 : 27 - 38
  • [2] Temporal-Spatial Filtering for Enhancement of Low-Light Surveillance Video
    Guo, Fan
    Tang, Jin
    Peng, Hui
    Zou, Beiji
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2016, 20 (04) : 652 - 661
  • [3] Temporally Consistent Enhancement of Low-Light Videos via Spatial-Temporal Compatible Learning
    Zhu, Lingyu
    Yang, Wenhan
    Chen, Baoliang
    Zhu, Hanwei
    Meng, Xiandong
    Wang, Shiqi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (10) : 4703 - 4723
  • [4] Adaptive Locally-Aligned Transformer for low-light video enhancement
    Cao, Yiwen
    Su, Yukun
    Deng, Jingliang
    Zhang, Yu
    Wu, Qingyao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 240
  • [5] Video Description with Spatial-Temporal Attention
    Tu, Yunbin
    Zhang, Xishan
    Liu, Bingtao
    Yan, Chenggang
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1014 - 1022
  • [6] Scene Retrieval in Soccer Videos by Spatial-temporal Attention with Video Vision Transformer
    Gan, Yaozong
    Togo, Ren
    Ogawa, Takahiro
    Haseyama, Mild
    2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 453 - 454
  • [7] Collaborative spatial-temporal video salient object detection with cross attention transformer
    Su, Yuting
    Wang, Weikang
    Liu, Jing
    Jing, Peiguang
    SIGNAL PROCESSING, 2024, 224
  • [8] Spatial-Temporal Sequence Attention Based Efficient Transformer for Video Snow Removal
    Gao, Tao
    Zhang, Qianxi
    Chen, Ting
    Wen, Yuanbo
    BIG DATA MINING AND ANALYTICS, 2025, 8 (03): : 551 - 562
  • [9] MAGAN:Unsupervised Low-Light Image Enhancement Guided by Mixed-Attention
    Renjun Wang
    Bin Jiang
    Chao Yang
    Qiao Li
    Bolin Zhang
    Big Data Mining and Analytics, 2022, 5 (02) : 110 - 119
  • [10] MAGAN: Unsupervised Low-Light Image Enhancement Guided by Mixed-Attention
    Wang, Renjun
    Jiang, Bin
    Yang, Chao
    Li, Qiao
    Zhang, Bolin
    BIG DATA MINING AND ANALYTICS, 2022, 5 (02) : 110 - 119