SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection From Multi-view Camera Images With Global Cross-Sensor Attention

被引:10
作者
Doll, Simon [1 ,3 ]
Schulz, Richard [1 ]
Schneider, Lukas [1 ]
Benzin, Viviane [1 ]
Enzweiler, Markus [2 ]
Lensch, Hendrik P. A. [3 ]
机构
[1] Mercedes Benz, Stuttgart, Germany
[2] Esslingen Univ Appl Sci, Stuttgart, Germany
[3] Univ Tubingen, Tubingen, Germany
来源
COMPUTER VISION, ECCV 2022, PT XXXIX | 2022年 / 13699卷
关键词
3D object detection; Cross-sensor attention; Autonomous driving;
D O I
10.1007/978-3-031-19842-7_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Based on the key idea of DETR this paper introduces an object-centric 3D object detection framework that operates on a limited number of 3D object queries instead of dense bounding box proposals followed by non-maximum suppression. After image feature extraction a decoder-only transformer architecture is trained on a set-based loss. SpatialDETR infers the classification and bounding box estimates based on attention both spatially within each image and across the different views. To fuse the multi-view information in the attention block we introduce a novel geometric positional encoding that incorporates the view ray geometry to explicitly consider the extrinsic and intrinsic camera setup. This way, the spatially-aware cross-view attention exploits arbitrary receptive fields to integrate cross-sensor data and therefore global context. Extensive experiments on the nuScenes benchmark demonstrate the potential of global attention and result in state-of-the-art performance. Code available at https://github.com/cgtuebingen/SpatialDETR.
引用
收藏
页码:230 / 245
页数:16
相关论文
共 33 条
  • [1] Caesar H, 2020, PROC CVPR IEEE, P11618, DOI 10.1109/CVPR42600.2020.01164
  • [2] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [3] Contributors M., 2020, MMDetection3D: OpenMMLab next-generation platform for general 3D object detection
  • [4] Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, 10.48550/arXiv.2010.11929]
  • [5] Gao P., 2021, P IEEE CVF C COMP VI, P3621
  • [6] github, DETR3D GITH REP
  • [7] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [8] Huang JJ, 2022, Arxiv, DOI arXiv:2112.11790
  • [9] Jaegle A, 2021, PR MACH LEARN RES, V139
  • [10] PointPillars: Fast Encoders for Object Detection from Point Clouds
    Lang, Alex H.
    Vora, Sourabh
    Caesar, Holger
    Zhou, Lubing
    Yang, Jiong
    Beijbom, Oscar
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12689 - 12697