SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection From Multi-view Camera Images With Global Cross-Sensor Attention

被引:10
|
作者
Doll, Simon [1 ,3 ]
Schulz, Richard [1 ]
Schneider, Lukas [1 ]
Benzin, Viviane [1 ]
Enzweiler, Markus [2 ]
Lensch, Hendrik P. A. [3 ]
机构
[1] Mercedes Benz, Stuttgart, Germany
[2] Esslingen Univ Appl Sci, Stuttgart, Germany
[3] Univ Tubingen, Tubingen, Germany
来源
关键词
3D object detection; Cross-sensor attention; Autonomous driving;
D O I
10.1007/978-3-031-19842-7_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Based on the key idea of DETR this paper introduces an object-centric 3D object detection framework that operates on a limited number of 3D object queries instead of dense bounding box proposals followed by non-maximum suppression. After image feature extraction a decoder-only transformer architecture is trained on a set-based loss. SpatialDETR infers the classification and bounding box estimates based on attention both spatially within each image and across the different views. To fuse the multi-view information in the attention block we introduce a novel geometric positional encoding that incorporates the view ray geometry to explicitly consider the extrinsic and intrinsic camera setup. This way, the spatially-aware cross-view attention exploits arbitrary receptive fields to integrate cross-sensor data and therefore global context. Extensive experiments on the nuScenes benchmark demonstrate the potential of global attention and result in state-of-the-art performance. Code available at https://github.com/cgtuebingen/SpatialDETR.
引用
收藏
页码:230 / 245
页数:16
相关论文
共 50 条
  • [41] PETR: Position Embedding Transformation for Multi-view 3D Object Detection
    Liu, Yingfei
    Wang, Tiancai
    Zhang, Xiangyu
    Sun, Jian
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 531 - 548
  • [42] AMVFNet: Attentive Multi-View Fusion Network for 3D Object Detection
    Huang, Yuxiao
    Huang, Zhicong
    Zhao, Jingwen
    Hu, Haifeng
    Chen, Dihu
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (01)
  • [43] OPEN: Object-Wise Position Embedding for Multi-view 3D Object Detection
    Hou, Jinghua
    Wang, Tong
    Ye, Xiaoqing
    Liu, Zhe
    Gong, Shi
    Tan, Xiao
    Ding, Errui
    Wang, Jingdong
    Bai, Xiang
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 146 - 162
  • [44] 3D Concept Learning and Reasoning from Multi-View Images
    Hong, Yining
    Lin, Chunru
    Du, Yilun
    Chen, Zhenfang
    Tenenbaum, Joshua B.
    Gan, Chuang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9202 - 9212
  • [45] A Framework for 3D Model Acquisition from Multi-View Images
    Duan, Chunmei
    PROCEEDINGS OF 2013 CHINESE INTELLIGENT AUTOMATION CONFERENCE: INTELLIGENT INFORMATION PROCESSING, 2013, 256 : 395 - 402
  • [46] SCA-PVNet: Self-and-cross attention based aggregation of point cloud and multi-view for 3D object retrieval
    Lin, Dongyun
    Cheng, Yi
    Guo, Aiyuan
    Mao, Shangbo
    Li, Yiqun
    KNOWLEDGE-BASED SYSTEMS, 2024, 296
  • [47] Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction
    Moliner, Olivier
    Huang, Sangxia
    Astrom, Kalle
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [48] Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
    Alaba, Simegnew Yihunie
    Ball, John E.
    IEEE ACCESS, 2024, 12 : 50165 - 50176
  • [49] 3D Object Retrieval Based on Multi-View Latent Variable Model
    Liu, An-An
    Nie, Wei-Zhi
    Su, Yu-Ting
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (03) : 868 - 880
  • [50] Adaptive Interaction-Based Multi-view 3D Object Reconstruction
    Miao, Jun
    Zheng, Yilin
    Yan, Jie
    Li, Lei
    Chu, Jun
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT II, 2023, 14255 : 51 - 64