Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

被引:0
|
作者
Xiao, Baihui [1 ]
Xu, Jingzehua [1 ]
Zhang, Zekai [1 ]
Xing, Tianyu [1 ]
Wang, Jingjing [2 ]
Ren, Yong [3 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China
[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China
来源
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II | 2024年 / 15017卷
基金
中国国家自然科学基金;
关键词
Frame Camera; Event Camera; Multi-modal Fusion; Transformer self-attention; Monocular depth estimation; VISION;
D O I
10.1007/978-3-031-72335-3_29
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.
引用
收藏
页码:419 / 433
页数:15
相关论文
共 50 条
  • [41] LungDepth: Self-Supervised Multi-Frame Monocular Depth Estimation for Bronchoscopy
    Xu, Jingsheng
    Guan, Bo
    Zhao, Jianchang
    Yi, Bo
    Li, Jianmin
    INTERNATIONAL JOURNAL OF MEDICAL ROBOTICS AND COMPUTER ASSISTED SURGERY, 2025, 21 (01)
  • [42] Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals
    Song, Minsoo
    Lim, Seokjae
    Kim, Wonjun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (11) : 4381 - 4393
  • [43] Infant Video Interaction Recognition Using Monocular Depth Estimation
    Rasmussen, Christopher
    Kiruga, Amani
    Orlando, Julie
    Lobo, Michele A.
    ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT I, 2025, 15046 : 156 - 169
  • [44] Sparse Transformer-based bins and Polarized Cross Attention decoder for monocular depth estimation
    Wang, Hai-Kun
    Du, Jiahui
    Song, Ke
    Cui, Limin
    ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2024, 54
  • [45] Self-supervised monocular depth estimation with self-distillation and dense skip connection
    Xiang, Xuezhi
    Li, Wei
    Wang, Yao
    El Saddik, Abdulmotaleb
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 246
  • [46] Attention Mono-Depth: Attention-Enhanced Transformer for Monocular Depth Estimation of Volatile Kiln Burden Surface
    Liu, Cong
    Zhang, Chaobo
    Liang, Xiaojun
    Han, Zhiming
    Li, Yiming
    Yang, Chunhua
    Gui, Weihua
    Gao, Wen
    Wang, Xiaohao
    Li, Xinghui
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1686 - 1699
  • [47] Two-View Monocular Depth Estimation by Optic-Flow-Weighted Fusion
    Kaneko, Alex Masuo
    Yamamoto, Kenjiro
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2019, 4 (02): : 830 - 837
  • [48] Indoor self-supervised monocular depth estimation based on level feature fusion
    Cheng D.
    Zhang H.
    Kou Q.
    Lü C.
    Qian J.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2023, 31 (20): : 2993 - 3009
  • [49] Monocular Depth Estimation Based on Residual Pooling and Global-Local Feature Fusion
    Li, Linke
    Liang, Zhengyou
    Liang, Xinyu
    Li, Shun
    IEEE ACCESS, 2024, 12 : 122785 - 122794
  • [50] SwinFusion: Channel Query-Response Based Feature Fusion for Monocular Depth Estimation
    Lai, Pengfei
    Yin, Mengxiao
    Yin, Yifan
    Xie, Min
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT II, 2024, 14426 : 246 - 258