SSD-MonoDETR: Supervised Scale-Aware Deformable Transformer for Monocular 3D Object Detection

被引:4
作者
He, Xuan [1 ,2 ]
Yang, Fan [1 ,2 ]
Yang, Kailun [3 ,4 ]
Lin, Jiacheng [1 ,2 ]
Fu, Haolong [1 ,2 ]
Wang, Meng [5 ]
Yuan, Jin [1 ,2 ]
Li, Zhiyong [2 ,6 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[2] Hunan Univ, Sch Robot, Changsha 410012, Peoples R China
[3] Hunan Univ, Sch Robot, Changsha 410012, Peoples R China
[4] Hunan Univ, Natl Engn Res Ctr Robot Visual Percept & Control, Changsha 410082, Peoples R China
[5] Hefei Univ Technol, Sch Comp Sci, Hefei 230009, Peoples R China
[6] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
来源
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES | 2024年 / 9卷 / 01期
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Transformers; Object detection; Feature extraction; Visualization; Decoding; Weighted sum model; Autonomous driving; monocular 3D object detection; scene understanding; vision transformer;
D O I
10.1109/TIV.2023.3311949
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which aims at predicting 3D attributes from a single 2D image. Most existing transformer-basedmethods leverage both visual and depth representations to explore valuable query points on objects, and the quality of the learned query points has a great impact on detection accuracy. Unfortunately, existing unsupervised attention mechanisms in transformers are prone to generate lowquality query features due to inaccurate receptive fields, especially on hard objects. To tackle this problem, this article proposes a novel "Supervised Scale-aware Deformable Attention" (SSDA) for monocular 3D object detection. Specifically, SSDA presets several masks with different scales and utilizes depth and visual features to adaptively learn a scale-aware filter for object query augmentation. Imposing the scale awareness, SSDAcould well predict the accurate receptive field of an object query to support robust query feature generation. Aside from this, SSDA is assigned with a Weighted Scale Matching (WSM) loss to supervise scale prediction, which presents more confident results as compared to the unsupervised attention mechanisms. Extensive experiments on the KITTI and Waymo Open datasets demonstrate that SSDA significantly improves the detection accuracy, especially on moderate and hard objects, yielding state-of-the-art performance as compared to the existing approaches.
引用
收藏
页码:555 / 567
页数:13
相关论文
共 82 条
  • [1] Brazil Garrick, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12368), P135, DOI 10.1007/978-3-030-58592-1_9
  • [2] M3D-RPN: Monocular 3D Region Proposal Network for Object Detection
    Brazil, Garrick
    Liu, Xiaoming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9286 - 9295
  • [3] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [4] MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation
    Chen, Hansheng
    Huang, Yuyao
    Tian, Wei
    Gao, Zhong
    Xiong, Lu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10374 - 10383
  • [5] 3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection
    Chen, Xiaozhi
    Kundu, Kaustav
    Zhu, Yukun
    Ma, Huimin
    Fidler, Sanja
    Urtasun, Raquel
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1259 - 1272
  • [6] Monocular 3D Object Detection for Autonomous Driving
    Chen, Xiaozhi
    Kundu, Kaustav
    Zhang, Ziyu
    Ma, Huimin
    Fidler, Sanja
    Urtasun, Raquel
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2147 - 2156
  • [7] Chen XZ, 2015, ADV NEUR IN, V28
  • [8] MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships
    Chen, Yongjian
    Tai, Lei
    Sun, Kai
    Li, Mingyang
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 12090 - 12099
  • [9] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
    Dai, Angela
    Chang, Angel X.
    Savva, Manolis
    Halber, Maciej
    Funkhouser, Thomas
    Niessner, Matthias
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2432 - 2443
  • [10] AO2-DETR: Arbitrary-Oriented Object Detection Transformer
    Dai, Linhui
    Liu, Hong
    Tang, Hao
    Wu, Zhiwei
    Song, Pinhao
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (05) : 2342 - 2356