Semi-Supervised Adversarial Monocular Depth Estimation

被引：35

作者：

Ji, Rongrong ^{[1
]}

Li, Ke ^{[2
]}

Wang, Yan ^{[3
]}

Sun, Xiaoshuai ^{[1
]}

Guo, Feng ^{[1
]}

Guo, Xiaowei ^{[2
]}

Wu, Yongjian ^{[2
]}

Huang, Feiyue ^{[2
]}

Luo, Jiebo ^{[4
]}

机构：

[1] Xiamen Univ, Dept Artificial Intelligence, Media Analyt & Comp Lab, Sch Informat, Xiamen 361005, Peoples R China

[2] Tencent Youtu Lab, Shanghai, Peoples R China

[3] Microsoft, Redmond, WA 98052 USA

[4] Univ Rochester, Dept Comp Sci, Rochester, NY 14627 USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2020年 / 42卷 / 10期

关键词：

Estimation; Generators; Training; Image reconstruction; Sensors; Adaptation models; Data models; Monocular depth estimation; generative adversarial learning; semi-supervise learning; SHAPE;

D O I：

10.1109/TPAMI.2019.2936024

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we address the problem of monocular depth estimation when only a limited number of training image-depth pairs are available. To achieve a high regression accuracy, the state-of-the-art estimation methods rely on CNNs trained with a large number of image-depth pairs, which are prohibitively costly or even infeasible to acquire. Aiming to break the curse of such expensive data collections, we propose a semi-supervised adversarial learning framework that only utilizes a small number of image-depth pairs in conjunction with a large number of easily-available monocular images to achieve high performance. In particular, we use one generator to regress the depth and two discriminators to evaluate the predicted depth, i.e., one inspects the image-depth pair while the other inspects the depth channel alone. These two discriminators provide their feedbacks to the generator as the loss to generate more realistic and accurate depth predictions. Experiments show that the proposed approach can (1) improve most state-of-the-art models on the NYUD v2 dataset by effectively leveraging additional unlabeled data sources; (2) reach state-of-the-art accuracy when the training set is small, e.g., on the Make3D dataset; (3) adapt well to an unseen new dataset (Make3D in our case) after training on an annotated dataset (KITTI in our case).

引用

页码：2410 / 2422

页数：13

共 52 条

[1] SHAPE FROM TEXTURE [J].

ALOIMONOS, J .

BIOLOGICAL CYBERNETICS, 1988, 58 (05) :345-360

[2]

[Anonymous], 2016, PROC 4 INT C LEARN R

[3]

[Anonymous], 2015, ACS SYM SER

[4]

[Anonymous], 2018, MULTIMEDIA TOOLS APP, DOI DOI 10.1007/S11042-018-6694-X#CITEAS

[5]

[Anonymous], 2014, arXiv

[6]

[Anonymous], 1981, MULTIPLE VIEW GEOMET

[7] Real-Time Monocular Depth Estimation using Synthetic Data with Domain Adaptation via Image Style Transfer [J].

Atapour-Abarghouei, Amir ;

Breckon, Toby P. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2800-2810

[8] The Cityscapes Dataset for Semantic Urban Scene Understanding [J].

Cordts, Marius ;

Omran, Mohamed ;

Ramos, Sebastian ;

Rehfeld, Timo ;

Enzweiler, Markus ;

Benenson, Rodrigo ;

Franke, Uwe ;

Roth, Stefan ;

Schiele, Bernt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223

[9]

Eigen D, 2014, ADV NEUR IN, V27

[10] Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture [J].

Eigen, David ;

Fergus, Rob .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2650-2658

← 1 2 3 4 5 6 →