Monocular Visual Object 3D Localization in Road Scenes

被引：15

作者：

Wang, Yizhou ^{[1
]}

Huang, Yen-Ting ^{[2
]}

Hwang, Jenq-Neng ^{[1
]}

机构：

[1] Univ Washington, Seattle, WA 98195 USA

[2] Natl ChengChi Univ, Pervas AI Res PAIR Labs, Taipei, Taiwan

来源：

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年

关键词：

object localization; monocular depthmap; ground plane estimation; tracklet smoothing; autonomous driving;

D O I：

10.1145/3343031.3350924

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

3D localization of objects in road scenes is important for autonomous driving and advanced driver-assistance systems (ADAS). However, with common monocular camera setups, 3D information is difficult to obtain. In this paper, we propose a novel and robust method for 3D localization of monocular visual objects in road scenes by joint integration of depth estimation, ground plane estimation, and multi-object tracking techniques. Firstly, an object depth estimation method with depth confidence is proposed by utilizing the monocular depthmap from a CNN. Secondly, an adaptive ground plane estimation using both dense and sparse features is proposed to localize the objects when their depth estimation is not reliable. Thirdly, temporal information is taken into consideration by a new object tracklet smoothing method. Unlike most existing methods which only consider vehicle localization, our method is applicable for common moving objects in the road scenes, including pedestrians, vehicles, cyclists, etc. Moreover, the input depthmap can be replaced by some equivalent depth information from other sensors, like LiDAR, depth camera and Radar, which makes our system much more competitive compared with other object localization methods. As evaluated on KITTI dataset, our method achieves favorable performance on 3D localization of both pedestrians and vehicles when compared with the state-of-the-art vehicle localization methods, though no published performance on pedestrian 3D localization can be compared with, from the best of our knowledge.

引用

页码：917 / 925

页数：9

共 40 条

[1] SLIC Superpixels Compared to State-of-the-Art Superpixel Methods [J].

Achanta, Radhakrishna ;

Shaji, Appu ;

Smith, Kevin ;

Lucchi, Aurelien ;

Fua, Pascal ;

Suesstrunk, Sabine .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (11) :2274-2281

[2]

[Anonymous], 2016, LECT NOTES COMPUT SC, DOI DOI 10.1007/978-3-319-46484-8_29

[3]

[Anonymous], 2017, P IEEE C COMP VIS PA

[4]

[Anonymous], ARXIV181106666

[5]

Ansari JA, 2018, IEEE INT C INT ROBOT, P8404, DOI 10.1109/IROS.2018.8593698

[6]

Chausse F, 2000, INT C PATT RECOG, P325, DOI 10.1109/ICPR.2000.905343

[7] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].

Chen, Liang-Chieh ;

Papandreou, George ;

Kokkinos, Iasonas ;

Murphy, Kevin ;

Yuille, Alan L. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848

[8] The Cityscapes Dataset for Semantic Urban Scene Understanding [J].

Cordts, Marius ;

Omran, Mohamed ;

Ramos, Sebastian ;

Rehfeld, Timo ;

Enzweiler, Markus ;

Benenson, Rodrigo ;

Franke, Uwe ;

Roth, Stefan ;

Schiele, Bernt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223

[9] Ground Plane Estimation using a Hidden Markov Model [J].

Dragon, Ralf ;

Van Gool, Luc .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :4026-4033

[10]

Eigen D, 2014, ADV NEUR IN, V27

← 1 2 3 4 →