Self-Supervised Monocular Depth Estimation With Positional Shift Depth Variance and Adaptive Disparity Quantization

被引：9

作者：

Bello, Juan Luis Gonzalez ^{[1
]}

Moon, Jaeho ^{[1
]}

Kim, Munchurl ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Scho Elect Engn, Daejeon 34141, South Korea

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Depth from videos; self-supervised; monocular depth estimation; deep convolutional neural networks;

D O I：

10.1109/TIP.2024.3374045

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attempts to learn the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion have drawn much attention. One of the most challenging aspects of this task is to handle independently moving objects as they break the rigid-scene assumption. In this paper, we show for the first time that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. The proposed moving object (MO) masks, which are induced by the depth variance to shifted positional information (SPI) and are referred to as 'SPIMO' masks, are highly robust and consistently remove independently moving objects from the scenes, allowing for robust and consistent learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for depth discretization, improving the fine granularity and accuracy of the final aggregated depth maps. Finally, we employ existing boosting techniques in a new way that self-supervises moving object depths further. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with four- to eight-fold fewer parameters than the previous SOTA techniques that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method.

引用

页码：2074 / 2089

页数：16

共 59 条

[21] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1026-1034

[22] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[23] Accurate and efficient stereo processing by semi-global matching and mutual information [J].

Hirschmüller, H .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, :807-814

[24] Perceptual Losses for Real-Time Style Transfer and Super-Resolution [J].

Johnson, Justin ;

Alahi, Alexandre ;

Li Fei-Fei .

COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :694-711

[25] Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume [J].

Johnston, Adrian ;

Carneiro, Gustavo .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4755-4764

[26]

Jung H., 2021, arXiv

[27] Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation [J].

Jung, Hyunyoung ;

Park, Eunhyeok ;

Yoo, Sungjoo .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :12622-12632

[28]

Kingsbury D, 2015, P1, DOI [DOI 10.1021/bk-2015-1214.ch001, DOI 10.48550/ARXIV.1412.6980]

[29] Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance [J].

Klingner, Marvin ;

Termoehlen, Jan-Aike ;

Mikolajczyk, Jonas ;

Fingscheidt, Tim .

COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :582-600

[30] A closed-form solution to natural image matting [J].

Levin, Anat ;

Lischinski, Dani ;

Weiss, Yair .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (02) :228-242

← 1 2 3 4 5 6 →