Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer

被引：0

作者：

Xiao, Baihui ^{[1
]}

Xu, Jingzehua ^{[1
]}

Zhang, Zekai ^{[1
]}

Xing, Tianyu ^{[1
]}

Wang, Jingjing ^{[2
]}

Ren, Yong ^{[3
]}

机构：

[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China

[3] Tsinghua Univ, Dept Elect Engn, Beijing, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II | 2024年 / 15017卷

基金：

中国国家自然科学基金;

关键词：

Frame Camera; Event Camera; Multi-modal Fusion; Transformer self-attention; Monocular depth estimation; VISION;

D O I：

10.1007/978-3-031-72335-3_29

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Frame cameras struggle to estimate depth maps accurately under abnormal lighting conditions. In contrast, event cameras, with their high temporal resolution and high dynamic range, can capture sparse, asynchronous event streams that record pixel brightness changes, addressing the limitations of frame cameras. However, the potential of asynchronous events remains underexploited, which hinders the ability of event cameras to predict dense depth maps effectively. Integrating event streams with frame data can significantly enhance the monocular depth estimation accuracy, especially in complex scenarios. In this study, we introduce a novel depth estimation framework that combines event and frame data using a transformer-based model. Our proposed framework contains two primary components: a multimodal encoder and a joint decoder. The multimodal encoder employs self-attention mechanisms to analyze the interactions between frame patches and event tensors, mapping out dependencies across local and global spatiotemporal events. This multi-scale fusion approach maximizes the benefits of both event and frame inputs. The joint decoder incorporates a dual-phase, triple-scale feature fusion module, which extracts contextual information and delivers detailed depth prediction results. Our experimental results on the EventScape and MVSEC datasets affirm that our method sets a new benchmark in performance.

引用

页码：419 / 433

页数：15

共 50 条

[31] Illumination Insensitive Monocular Depth Estimation Based on Scene Object Attention and Depth Map Fusion
Wen, Jing
Ma, Haojiang
Yang, Jie
Zhang, Songsong
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT X, 2024, 14434 : 358 - 370
[32] PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation
Xia, Chenxing
Duan, Xiuzhen
Gao, Xiuju
Ge, Bin
Li, Kuan-Ching
Fang, Xianjin
Zhang, Yan
Yang, Ke
NEURAL PROCESSING LETTERS, 2024, 56 (02)
[33] PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation
Chenxing Xia
Xiuzhen Duan
Xiuju Gao
Bin Ge
Kuan-Ching Li
Xianjin Fang
Yan Zhang
Ke Yang
Neural Processing Letters, 56
[34] Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration
Wang, Zhe
Zou, Yongjia
Lv, Jin
Cao, Yang
Yu, Hongfei
IEEE ACCESS, 2024, 12 : 167934 - 167943
[35] EMTNet: efficient mobile transformer network for real-time monocular depth estimation
Yan, Long
Yu, Fuyang
Dong, Chao
PATTERN ANALYSIS AND APPLICATIONS, 2023, 26 (04) : 1833 - 1846
[36] PCTNet:3D Point Cloud and Transformer Network for Monocular Depth Estimation
Hong, Yusheng
Liu, Xiaolong
Dai, Hang
Tao, Wenqi
2022 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND EDUCATION TECHNOLOGY (ICIET 2022), 2022, : 415 - 419
[37] LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation
Zheng, Kecheng
Zha, Zheng-Jun
Cao, Yang
Chen, Xuejin
Wu, Feng
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1381 - 1388
[38] Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features
Wang, Weiqiang
Tan, Chao
Yan, Yunbing
ELECTRONICS, 2023, 12 (22)
[39] EMTNet: efficient mobile transformer network for real-time monocular depth estimation
Long Yan
Fuyang Yu
Chao Dong
Pattern Analysis and Applications, 2023, 26 : 1833 - 1846
[40] Monocular depth estimation via cross-spectral stereo information fusion
Liu, Huwei
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (21) : 61065 - 61081

← 1 2 3 4 5 →