Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

被引:3
作者
Cao, Miao [1 ,2 ,3 ]
Wang, Lishun [2 ,3 ]
Zhu, Mingyu [2 ,3 ]
Yuan, Xin [2 ,3 ]
机构
[1] Zhejiang Univ, Hangzhou 310058, Zhejiang, Peoples R China
[2] Westlake Univ, Sch Engn, Hangzhou 310030, Zhejiang, Peoples R China
[3] Westlake Univ, Res Ctr Ind Future, Hangzhou 310030, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Computational imaging; Snapshot compressive imaging; Compressive sensing; Deep learning; Convolutional neural networks; Transformer;
D O I
10.1007/s11263-024-02101-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video snapshot compressive imaging (SCI) uses a low-speed 2D detector to capture high-speed scene, where the dynamic scene is modulated by different masks and then compressed into a snapshot measurement. Following this, a reconstruction algorithm is needed to reconstruct the high-speed video frames. Although state-of-the-art (SOTA) deep learning-based reconstruction algorithms have achieved impressive results, they still face the following challenges due to excessive model complexity and GPU memory limitations: (1) These models need high computational cost, and (2) They are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using hierarchical residual-like connections and hybrid CNN-Transformer structure within a single residual block, dubbed EfficientSCI++. The EfficientSCI++ network can well explore spatial-temporal correlation using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to demonstrate that a UHD color video (1644x3840x3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1644\times {3840}\times {3}$$\end{document}) with high compression ratio (40) can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 34 dB. Moreover, a mixed-precision model is trained to further accelerate the video SCI reconstruction process and save memory footprint. Extensive results on both simulation and real data demonstrate that, compared with precious SOTA methods, our proposed EfficientSCI++ and EfficientSCI can achieve comparable reconstruction quality with much cheaper computational cost and better real-time performance. Code is available at https://github.com/mcao92/EfficientSCI-plus-plus.
引用
收藏
页码:4521 / 4540
页数:20
相关论文
共 64 条
  • [1] Ba Jimmy Lei, 2016, ARXIV160706450, V1050, P8
  • [2] SCTANet: A Spatial Attention-Guided CNN-Transformer Aggregation Network for Deep Face Image Super-Resolution
    Bao, Qiqi
    Liu, Yunmeng
    Gang, Bowen
    Yang, Wenming
    Liao, Qingmin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8554 - 8565
  • [3] Behrmann J., 2019, INT C MACHINE LEARNI, P573
  • [4] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [5] Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction
    Cai, Yuanhao
    Lin, Jing
    Hu, Xiaowan
    Wang, Haoqian
    Yuan, Xin
    Zhang, Yulun
    Timofte, Radu
    Van Gool, Luc
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17481 - 17490
  • [6] Robust uncertainty principles:: Exact signal reconstruction from highly incomplete frequency information
    Candès, EJ
    Romberg, J
    Tao, T
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2006, 52 (02) : 489 - 509
  • [7] GLEAN: Generative Latent Bank for Image Super-Resolution and Beyond
    Chan, Kelvin C. K.
    Xu, Xiangyu
    Wang, Xintao
    Gu, Jinwei
    Loy, Chen Change
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3154 - 3168
  • [8] Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
    Chang, Ya-Liang
    Liu, Zhe Yu
    Lee, Kuan-Ying
    Hsu, Winston
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9065 - 9074
  • [9] Chen X, 2023, AAAI CONF ARTIF INTE, P378
  • [10] Memory-Efficient Network for Large-scale Video Compressive Sensing
    Cheng, Ziheng
    Chen, Bo
    Liu, Guanliang
    Zhang, Hao
    Lu, Ruiying
    Wang, Zhengjue
    Yuan, Xin
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16241 - 16250