MonoPSTR: Monocular 3-D Object Detection With Dynamic Position and Scale-Aware Transformer

被引:0
作者
Yang, Fan [1 ]
He, Xuan [2 ]
Chen, Wenrui [1 ,3 ]
Zhou, Pengjie [2 ]
Li, Zhiyong [2 ,3 ]
机构
[1] Hunan Univ, Sch Robot, Changsha 410012, Peoples R China
[2] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
[3] Hunan Univ, Natl Engn Res Ctr Robot Visual Percept & Control T, Changsha 410082, Peoples R China
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Transformers; Object detection; Decoding; Training; Accuracy; Feature extraction; Autonomous driving; monocular 3-D object detection; robotics; scene understanding; transformer;
D O I
10.1109/TIM.2024.3415231
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Transformer-based approaches have demonstrated outstanding performance in monocular 3-D object detection, which involves predicting 3-D attributes from a single 2-D image. These transformer-based methods typically rely on visual and depth representations to identify crucial queries related to objects. However, the feature and location of queries are expected to learn adaptively without any prior knowledge, which often leads to an imprecise location in some complex scenes and a long-time training process. To overcome this limitation, we present MonoPSTR, which employs a dynamic position and scale-aware transformer for monocular 3-D detection. Our approach introduces a dynamically and explicitly position-coded query (DEP-query) and a scale-assisted deformable attention (SDA) module to help the raw query possess valuable spatial and content cues. Specifically, the DEP-query employs explicit position priors of 3-D projection coordinates to enhance the accuracy of query localization, thereby enabling the attention layers in the decoder to avoid noisy background information. The SDA module optimizes the receptive field learning of queries by the size priors of the corresponding 2-D boxes; thus, the queries could acquire high-quality visual features. Both the position and size priors do not require any additional data and are updated in each layer of the decoder to provide long-term assistance. Extensive experiments show that our model outperforms all the existing methods in terms of inference speed, which reaches the impressive 62.5 FPS. What is more, compared with the backbone MonoDETR, our MonoPSTR achieves around two times of training convergence speed and surpasses its precision by over 1.15% on famous KITTI dataset, demonstrating the sufficient practical value. The code is available at: https://github.com/yangfan293/MonoPSTR/tree/master/MonoPSTR.
引用
收藏
页数:13
相关论文
共 47 条
  • [1] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [2] MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation
    Chen, Hansheng
    Huang, Yuyao
    Tian, Wei
    Gao, Zhong
    Xiong, Lu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10374 - 10383
  • [3] Monocular 3D Object Detection for Autonomous Driving
    Chen, Xiaozhi
    Kundu, Kaustav
    Zhang, Ziyu
    Ma, Huimin
    Fidler, Sanja
    Urtasun, Raquel
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2147 - 2156
  • [4] Chen XZ, 2015, ADV NEUR IN, V28
  • [5] Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving
    Chen, Yi-Nan
    Dai, Hang
    Ding, Yong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 877 - 887
  • [6] MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships
    Chen, Yongjian
    Tai, Lei
    Sun, Kai
    Li, Mingyang
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 12090 - 12099
  • [7] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
    Dai, Angela
    Chang, Angel X.
    Savva, Manolis
    Halber, Maciej
    Funkhouser, Thomas
    Niessner, Matthias
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2432 - 2443
  • [8] Ding M, 2020, PROC CVPR IEEE, P11669, DOI 10.1109/CVPR42600.2020.01169
  • [9] Large Scale Interactive Motion Forecasting for Autonomous Driving : The WAYMO OPEN MOTION DATASET
    Ettinger, Scott
    Cheng, Shuyang
    Caine, Benjamin
    Liu, Chenxi
    Zhao, Hang
    Pradhan, Sabeek
    Chai, Yuning
    Sapp, Ben
    Qi, Charles
    Zhou, Yin
    Yang, Zoey
    Chouard, Aurelien
    Sun, Pei
    Ngiam, Jiquan
    Vasudevan, Vijay
    McCauley, Alexander
    Shlens, Jonathon
    Anguelov, Dragomir
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9690 - 9699
  • [10] Fast Convergence of DETR with Spatially Modulated Co-Attention
    Gao, Peng
    Zheng, Minghang
    Wang, Xiaogang
    Dai, Jifeng
    Li, Hongsheng
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3601 - 3610