Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

被引：13

作者：

Wang, Xiang ^{[1
]}

Luo, Haonan ^{[2
]}

Wang, Zihang ^{[1
]}

Zheng, Jin ^{[1
]}

Bai, Xiao ^{[1
]}

机构：

[1] Beihang Univ, Jiangxi Res Inst, Sch Comp Sci & Engn, State Key Lab Complex & Crit Software Environm, Beijing 100191, Peoples R China

[2] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China

来源：

INFORMATION FUSION | 2024年 / 108卷

关键词：

Multi-frame depth estimation; Visual-inertial fusion; Transformer; ODOMETRY; STEREO; MOTION; ROBUST; AWARE;

D O I：

10.1016/j.inffus.2024.102363

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised monocular depth estimation has been a popular topic since it does not need labor-intensive depth ground truth collection. However, the accuracy of monocular network is limited as it can only utilize context provided in the single image, ignoring the geometric clues resided in videos. Most recently, multi-frame depth networks are introduced to the self-supervised depth learning framework to ameliorate monocular depth, which explicitly encode the geometric information via pairwise cost volume construction. In this paper, we address two main issues that affect the cost volume construction and thus the multi-frame depth estimation. First, camera pose estimation, which determines the epipolar geometry in cost volume construction but has rarely been addressed, is enhanced with additional inertial modality. Complementary visual and inertial modality are fused adaptively to provide accurate camera pose with a novel visual-inertial fusion Transformer, in which self-attention takes effect in visual-inertial feature interaction and cross-attention is utilized for task feature decoding and pose regression. Second, the monocular depth prior, which contains contextual information about the scene, is introduced to the multi-frame cost volume aggregation at the feature level. A novel monocular guided cost volume excitation module is proposed to adaptively modulate cost volume features and address possible matching ambiguity. With the proposed modules, a self-supervised multi-frame depth estimation network is presented, consisting of a monocular depth branch as prior, a camera pose branch integrating both visual and inertial modality, and a multi-frame depth branch producing the final depth with the aid of former two branches. Experimental results on the KITTI dataset show that our proposed method achieves notable performance boost on multi-frame depth estimation over the state-of-the-art competitors. Compared with ManyDepth and MOVEDepth, our method relatively improves depth accuracy by 9.2% and 5.3% on the KITTI dataset.

引用

页数：13

共 86 条

[1] SelfVIO: Self-supervised deep monocular Visual-Inertial Odometry and depth estimation [J].

Almalioglu, Yasin ;

Turan, Mehmet ;

Saputra, Muhamad Risqi U. ;

de Gusmao, Pedro P. B. ;

Markham, Andrew ;

Trigoni, Niki .

NEURAL NETWORKS, 2022, 150 :119-136

[2]

Ba J, 2014, ACS SYM SER

[3] Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation [J].

Bangunharcana, Antyanta ;

Cho, Jae Won ;

Lee, Seokju ;

Kweon, In So ;

Kim, Kyung-Soo ;

Kim, Soohyun .

2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, :3542-3548

[4]

Bian JW, 2019, ADV NEUR IN, V32

[5] ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM [J].

Campos, Carlos ;

Elvira, Richard ;

Gomez Rodriguez, Juan J. ;

Montiel, Jose M. M. ;

Tardos, Juan D. .

IEEE TRANSACTIONS ON ROBOTICS, 2021, 37 (06) :1874-1890

[6] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[7]

Casser V, 2019, AAAI CONF ARTIF INTE, P8001

[8] Pyramid Stereo Matching Network [J].

Chang, Jia-Ren ;

Chen, Yong-Sheng .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5410-5418

[9] Selective Sensor Fusion for Neural Visual-Inertial Odometry [J].

Chen, Changhao ;

Rosa, Stefano ;

Miao, Yishu ;

Lu, Chris Xiaoxuan ;

Wu, Wei ;

Markham, Andrew ;

Trigoni, Niki .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10534-10543

[10] Self-supervised Learning with Geometric Constraints in Monocular Video Connecting Flow, Depth, and Camera [J].

Chen, Yuhua ;

Schmid, Cordelia ;

Sminchisescu, Cristian .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7062-7071

← 1 2 3 4 5 6 7 8 9 →