Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

被引:0
|
作者
Xiao, Baihui [1 ]
Xu, Jingzehua [1 ]
Zhang, Zekai [1 ]
Xing, Tianyu [1 ]
Wang, Jingjing [2 ]
Ren, Yong [3 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
来源
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II | 2024年 / 15017卷
基金
中国国家自然科学基金;
关键词
Frame Camera; Event Camera; Multi-modal Fusion; Transformer self-attention; Monocular depth estimation; VISION;
D O I
10.1007/978-3-031-72335-3_29
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.
引用
收藏
页码:419 / 433
页数:15
相关论文
共 50 条
  • [1] Monocular Dense Reconstruction by Depth Estimation Fusion
    Chen, Tian
    Ding, Wendong
    Zhang, Dapeng
    Liu, Xilong
    PROCEEDINGS OF THE 30TH CHINESE CONTROL AND DECISION CONFERENCE (2018 CCDC), 2018, : 4460 - 4465
  • [2] Lightweight monocular depth estimation using a fusion-improved transformer
    Sui, Xin
    Gao, Song
    Xu, Aigong
    Zhang, Cong
    Wang, Changqiang
    Shi, Zhengxu
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [3] Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation
    Yang, Wei-Jong
    Wu, Chih-Chen
    Yang, Jar-Ferr
    SENSORS, 2025, 25 (01)
  • [4] CATNet: Convolutional attention and transformer for monocular depth estimation
    Tang, Shuai
    Lu, Tongwei
    Liu, Xuanxuan
    Zhou, Huabing
    Zhang, Yanduo
    PATTERN RECOGNITION, 2024, 145
  • [5] A Contour-Aware Monocular Depth Estimation Network Using Swin Transformer and Cascaded Multiscale Fusion
    Li, Tao
    Zhang, Yi
    IEEE SENSORS JOURNAL, 2024, 24 (08) : 13620 - 13628
  • [6] Monocular depth estimation based on dense connections
    Wang, Quande
    Cheng, Kai
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51 (11): : 75 - 82
  • [7] Transformer-based monocular depth estimation with hybrid attention fusion and progressive regression
    Liu, Peng
    Zhang, Zonghua
    Meng, Zhaozong
    Gao, Nan
    NEUROCOMPUTING, 2025, 620
  • [8] Triple-Supervised Convolutional Transformer Aggregation for Robust Monocular Endoscopic Dense Depth Estimation
    Fan, Wenkang
    Jiang, Wenjing
    Shi, Hong
    Zeng, Hui-Qing
    Chen, Yinran
    Luo, Xiongbiao
    IEEE TRANSACTIONS ON MEDICAL ROBOTICS AND BIONICS, 2024, 6 (03): : 1017 - 1029
  • [9] DTTNet: Depth Transverse Transformer Network for Monocular Depth Estimation
    Kamath, Shreyas K. M.
    Rajeev, Srijith
    Panetta, Karen
    Agaian, Sos S.
    MULTIMODAL IMAGE EXPLOITATION AND LEARNING 2022, 2022, 12100
  • [10] Event-Based Monocular Depth Estimation With Recurrent Transformers
    Liu, Xu
    Li, Jianing
    Shi, Jinqiao
    Fan, Xiaopeng
    Tian, Yonghong
    Zhao, Debin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7417 - 7429