Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

被引:56
作者
Harley, Adam W. [1 ]
Fang, Zhaoyuan [2 ]
Li, Jie [3 ]
Ambrus, Rares [3 ]
Fragkiadaki, Katerina [2 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Toyota Res Inst, Los Altos, CA USA
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA | 2023年
关键词
VIEW;
D O I
10.1109/ICRA48891.2023.10160831
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV) feature representation of the 3D space around the vehicle. This line of work has produced a variety of novel "lifting" methods, but we observe that other details in the training setups have shifted at the same time, making it unclear what really matters in top-performing methods. We also observe that using cameras alone is not a real-world constraint, considering that additional sensors like radar have been integrated into real vehicles for years already. In this paper, we first of all attempt to elucidate the high-impact factors in the design and training protocol of BEV perception models. We find that batch size and input resolution greatly affect performance, while lifting strategies have a more modest effect-even a simple parameter-free lifter works well. Second, we demonstrate that radar data can provide a substantial boost to performance, helping to close the gap between camera-only and LiDAR-enabled systems. We analyze the radar usage details that lead to good performance, and invite the community to re-consider this commonly-neglected part of the sensor platform.
引用
收藏
页码:2759 / 2765
页数:7
相关论文
共 44 条
[31]   Enabling spatio-temporal aggregation in Birds-Eye-View Vehicle Estimation [J].
Saha, Avishkar ;
Mendez, Oscar ;
Russell, Chris ;
Bowden, Richard .
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :5133-5139
[32]  
Saha Avishkar, 2022, ICRA
[33]   Learning to Look around Objects for Top-View Representations of Outdoor Scenes [J].
Schulter, Samuel ;
Zhai, Menghua ;
Jacobs, Nathan ;
Chandraker, Manmohan .
COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 :815-831
[34]  
Schumann O, 2018, 2018 21ST INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), P2179, DOI 10.23919/ICIF.2018.8455344
[35]   DeepVoxels: Learning Persistent 3D Feature Embeddings [J].
Sitzmann, Vincent ;
Thies, Justus ;
Heide, Felix ;
Niessner, Matthias ;
Wetzstein, Gordon ;
Zollhofer, Michael .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2432-2441
[36]  
Sless L., 2019, CVPR WORKSH, P0
[37]   Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates [J].
Smith, Leslie N. ;
Topin, Nicholay .
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FOR MULTI-DOMAIN OPERATIONS APPLICATIONS, 2019, 11006
[38]  
TAN MX, 2019, PMLR, P6105, DOI DOI 10.48550/ARXIV.1905.11946
[39]  
Tung Hsiao-Yu Fish, 2019, CVPR
[40]   Learning Interpretable End-to-End Vision-Based Motion Planning for Autonomous Driving with Optical Flow Distillation [J].
Wang, Hengli ;
Cai, Peide ;
Sun, Yuxiang ;
Wang, Lujia ;
Liu, Ming .
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :13731-13737