Unsupervised framework for depth estimation and camera motion prediction from video

被引:13
作者
Yang, Delong [1 ]
Zhong, Xunyu [1 ]
Gu, Dongbing [2 ]
Peng, Xiafu [1 ]
Hu, Huosheng [2 ]
机构
[1] Xiamen Univ, Dept Automat, Xiamen 361005, Peoples R China
[2] Univ Essex, Sch Comp Sci & Elect Engn, Colchester CO4 3SQ, Essex, England
关键词
Unsupervised deep learning; Depth estimation; Camera motion prediction; Convolutional neural network; NETWORK;
D O I
10.1016/j.neucom.2019.12.049
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Depth estimation from monocular video plays a crucial role in scene perception. The significant drawback of supervised learning models is the need for vast amounts of manually labeled data (ground truth) for training. To overcome this limitation, unsupervised learning strategies without the requirement for ground truth have achieved extensive attention from researchers in the past few years. This paper presents a novel unsupervised framework for estimating single-view depth and predicting camera motion jointly. Stereo image sequences are used to train the model while monocular images are required for inference. The presented framework is composed of two CNNs (depth CNN and pose CNN) which are trained concurrently and tested independently. The objective function is constructed on the basis of the epipolar geometry constraints between stereo image sequences. To improve the accuracy of the model, a left-right consistency loss is added to the objective function. The use of stereo image sequences enables us to utilize both spatial information between stereo images and temporal photometric warp error from image sequences. Experimental results on the KITTI and Cityscapes datasets show that our model not only outperforms prior unsupervised approaches but also achieving better results comparable with several supervised methods. Moreover, we also train our model on the Euroc dataset which is captured in an indoor environment. Experiments in indoor and outdoor scenes are conducted to test the generalization capability of the model. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:169 / 185
页数:17
相关论文
共 49 条
[11]   Symmetric Non-rigid Structure from Motion for Category-Specific Object Structure Estimation [J].
Gao, Yuan ;
Yuille, Alan L. .
COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :408-424
[12]   Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue [J].
Garg, Ravi ;
VijayKumar, B. G. ;
Carneiro, Gustavo ;
Reid, Ian .
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :740-756
[13]  
Geiger A, 2012, PROC CVPR IEEE, P3354, DOI 10.1109/CVPR.2012.6248074
[14]   Unsupervised Learning-Based Depth Estimation-Aided Visual SLAM Approach [J].
Geng, Mingyang ;
Shang, Suning ;
Ding, Bo ;
Wang, Huaimin ;
Zhang, Pengfei .
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (02) :543-570
[15]   Unsupervised Monocular Depth Estimation with Left-Right Consistency [J].
Godard, Clement ;
Mac Aodha, Oisin ;
Brostow, Gabriel J. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6602-6611
[16]   Imaging cellular network dynamics in three dimensions using fast 3D laser scanning [J].
Goebel, Werner ;
Kampa, Bjoern M. ;
Helmchen, Fritjof .
NATURE METHODS, 2007, 4 (01) :73-79
[17]  
Handa A, 2014, IEEE INT CONF ROBOT, P1524, DOI 10.1109/ICRA.2014.6907054
[18]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[19]   Multiview photometric stereo [J].
Hernandez, Carlos ;
Vogiatzis, George ;
Cipolla, Roberto .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (03) :548-554
[20]   Depth estimation with convolutional conditional random field network [J].
Hua, Yan ;
Tian, Hu .
NEUROCOMPUTING, 2016, 214 :546-554