Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

被引：663

作者：

Philion, Jonah ^{[1
,2
,3
]}

Fidler, Sanja ^{[1
,2
,3
]}

机构：

[1] NVIDIA, Santa Clara, CA 95050 USA

[2] Univ Toronto, Toronto, ON, Canada

[3] Vector Inst, Chennai, Tamil Nadu, India

来源：

COMPUTER VISION - ECCV 2020, PT XIV | 2020年 / 12359卷

关键词：

D O I：

10.1007/978-3-030-58568-6_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: https://nv-tlabs.github.io/lift-splat-shoot.

引用

页码：194 / 210

页数：17

共 41 条

[1]

Badrinarayanan V, 2016, Arxiv, DOI [arXiv:1511.00561, DOI 10.1109/TPAMI.2016.2644615]

[2]

Caesar H, 2020, Arxiv, DOI arXiv:1903.11027

[3] Argoverse: 3D Tracking and Forecasting with Rich Maps [J].

Chang, Ming-Fang ;

Lambert, John ;

Sangkloy, Patsorn ;

Singh, Jagjeet ;

Bak, Slawomir ;

Hartnett, Andrew ;

Wang, De ;

Carr, Peter ;

Lucey, Simon ;

Ramanan, Deva ;

Hays, James .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :8740-8749

[4] Monocular 3D Object Detection for Autonomous Driving [J].

Chen, Xiaozhi ;

Kundu, Kaustav ;

Zhang, Ziyu ;

Ma, Huimin ;

Fidler, Sanja ;

Urtasun, Raquel .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2147-2156

[5]

Ghiasi G, 2018, Arxiv, DOI [arXiv:1810.12890, 10.48550/arXiv.1810.12890, DOI 10.48550/ARXIV.1810.12890]

[6]

Goodfellow I, 2016, ADAPT COMPUT MACH LE, P1

[7]

Lang AH, 2019, Arxiv, DOI [arXiv:1812.05784, 10.48550/arXiv.1812.05784]

[8]

He KM, 2018, Arxiv, DOI [arXiv:1703.06870, 10.48550/arXiv.1703.06870]

[9]

He KM, 2015, Arxiv, DOI arXiv:1512.03385

[10]

Hendy N., 2020, Fishing net: future inference of semantic heatmaps in grids

← 1 2 3 4 5 →