Fast-BEV: A Fast and Strong Birds-Eye View Perception Baseline

被引:7
作者
Li, Yangguang [1 ]
Huang, Bin [2 ]
Chen, Zeren [3 ]
Cui, Yufeng [3 ]
Liang, Feng [4 ]
Shen, Mingzhu [3 ]
Liu, Fenggang [1 ]
Xie, Enze [5 ]
Sheng, Lu [3 ]
Ouyang, Wanli [6 ]
Shao, Jing [6 ]
机构
[1] SenseTime Res, Beijing 100080, Peoples R China
[2] Hozon New Energy Automobile Co Ltd, Shanghai 200062, Peoples R China
[3] Beihang Univ, Beijing 100191, Peoples R China
[4] Univ Texas Austin, Austin, TX 78712 USA
[5] Univ Hong Kong, Pok Fu Lam, Hong Kong, Peoples R China
[6] Shanghai AI Lab, Shanghai 200232, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-camera; bird's-eye view (BEV) representation; autonomous driving;
D O I
10.1109/TPAMI.2024.3414835
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation or depth representation. Our Fast-BEV consists of five parts, we innovatively propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image features to 3D voxel space, (2) a multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Among them, (1) and (3) enable Fast-BEV to be fast inference and deployment friendly on the on-vehicle chips, (2), (4) and (5) ensure that Fast-BEV has competitive performance. All these make Fast-BEV a solution with high performance, fast inference speed, and deployment-friendly on the on-vehicle chips of autonomous driving. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model (Li et al. 2022) and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model (J. Huang and G. Huang, 2022). Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips.
引用
收藏
页码:8665 / 8679
页数:15
相关论文
共 42 条
  • [21] neuvition, 2022, LiDAR price for cars-neuvition: Solid-state LiDAR, LiDAR sensor suppliers, LiDAR technology, LiDAR sensor
  • [22] Is Pseudo-Lidar needed for Monocular 3D Object detection?
    Park, Dennis
    Ambrus, Rares
    Guizilini, Vitor
    Li, Jie
    Gaidon, Adrien
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3122 - 3132
  • [23] Philion Jonah, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12359), P194, DOI 10.1007/978-3-030-58568-6_12
  • [24] Categorical Depth Distribution Network for Monocular 3D Object Detection
    Reading, Cody
    Harakeh, Ali
    Chae, Julia
    Waslander, Steven L.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8551 - 8560
  • [25] Ren S., 2015, P ADV NEUR INF PROC, P3122
  • [26] Roddick T, 2018, Arxiv, DOI arXiv:1811.08188
  • [27] ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection
    Rukhovich, Danila
    Vorontsova, Anna
    Konushin, Anton
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 1265 - 1274
  • [28] Saha A., 2021, arXiv
  • [29] PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud
    Shi, Shaoshuai
    Wang, Xiaogang
    Li, Hongsheng
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 770 - 779
  • [30] Scalability in Perception for Autonomous Driving: Waymo Open Dataset
    Sun, Pei
    Kretzschmar, Henrik
    Dotiwalla, Xerxes
    Chouard, Aurelien
    Patnaik, Vijaysai
    Tsui, Paul
    Guo, James
    Zhou, Yin
    Chai, Yuning
    Caine, Benjamin
    Vasudevan, Vijay
    Han, Wei
    Ngiam, Jiquan
    Zhao, Hang
    Timofeev, Aleksei
    Ettinger, Scott
    Krivokon, Maxim
    Gao, Amy
    Joshi, Aditya
    Zhang, Yu
    Shlens, Jonathon
    Chen, Zhifeng
    Anguelov, Dragomir
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2443 - 2451