Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

被引：56

作者：

Harley, Adam W. ^{[1
]}

Fang, Zhaoyuan ^{[2
]}

Li, Jie ^{[3
]}

Ambrus, Rares ^{[3
]}

Fragkiadaki, Katerina ^{[2
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[3] Toyota Res Inst, Los Altos, CA USA

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA | 2023年

关键词：

VIEW;

D O I：

10.1109/ICRA48891.2023.10160831

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV) feature representation of the 3D space around the vehicle. This line of work has produced a variety of novel "lifting" methods, but we observe that other details in the training setups have shifted at the same time, making it unclear what really matters in top-performing methods. We also observe that using cameras alone is not a real-world constraint, considering that additional sensors like radar have been integrated into real vehicles for years already. In this paper, we first of all attempt to elucidate the high-impact factors in the design and training protocol of BEV perception models. We find that batch size and input resolution greatly affect performance, while lifting strategies have a more modest effect-even a simple parameter-free lifter works well. Second, we demonstrate that radar data can provide a substantial boost to performance, helping to close the gap between camera-only and LiDAR-enabled systems. We analyze the radar usage details that lead to good performance, and invite the community to re-consider this commonly-neglected part of the sensor platform.

引用

页码：2759 / 2765

页数：7

共 44 条

[31] Enabling spatio-temporal aggregation in Birds-Eye-View Vehicle Estimation [J].

Saha, Avishkar ;

Mendez, Oscar ;

Russell, Chris ;

Bowden, Richard .

2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :5133-5139

[32]

Saha Avishkar, 2022, ICRA

[33] Learning to Look around Objects for Top-View Representations of Outdoor Scenes [J].

Schulter, Samuel ;

Zhai, Menghua ;

Jacobs, Nathan ;

Chandraker, Manmohan .

COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 :815-831

[34]

Schumann O, 2018, 2018 21ST INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), P2179, DOI 10.23919/ICIF.2018.8455344

[35] DeepVoxels: Learning Persistent 3D Feature Embeddings [J].

Sitzmann, Vincent ;

Thies, Justus ;

Heide, Felix ;

Niessner, Matthias ;

Wetzstein, Gordon ;

Zollhofer, Michael .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2432-2441

[36]

Sless L., 2019, CVPR WORKSH, P0

[37] Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates [J].

Smith, Leslie N. ;

Topin, Nicholay .

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FOR MULTI-DOMAIN OPERATIONS APPLICATIONS, 2019, 11006

[38]

TAN MX, 2019, PMLR, P6105, DOI DOI 10.48550/ARXIV.1905.11946

[39]

Tung Hsiao-Yu Fish, 2019, CVPR

[40] Learning Interpretable End-to-End Vision-Based Motion Planning for Autonomous Driving with Optical Flow Distillation [J].

Wang, Hengli ;

Cai, Peide ;

Sun, Yuxiang ;

Wang, Lujia ;

Liu, Ming .

2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :13731-13737

← 1 2 3 4 5 →