Instance-aware sampling and voxel-transformer encoding for single-stage 3D object detection

被引:0
作者
Wang, Baotong [1 ]
Xia, Chenxing [1 ]
Gao, Xiuju [2 ]
Yang, Yuan [3 ]
Li, Kuan-Ching [4 ]
Fang, Xianjin [1 ]
Zhang, Yan [5 ]
Ge, Sijia [6 ]
机构
[1] Anhui Univ Sci & Technol, Coll Comp Sci & Engn, Huainan 232001, Peoples R China
[2] Anhui Univ Sci & Technol, Coll Elect & Informat Engn, Huainan 232001, Peoples R China
[3] Anhui Univ Sci & Technol, Sch Math & Big Data, Huainan 232001, Peoples R China
[4] Providence Univ, Dept Comp Sci & Informat Engn, Taichung 43301, Taiwan
[5] Anhui Univ, Sch Elect & Informat Engn, Hefei 230039, Peoples R China
[6] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei 230039, Peoples R China
关键词
Collaborative enhancement; Dual-channel; Object detection; Point cloud; Weighted sampling;
D O I
10.1016/j.dsp.2025.105171
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In point cloud 3D object detection tasks, single-stage detectors offer fast inference but are less accurate than two-stage detectors. We point out two main problems: first, traditional methods deal with the whole point cloud, making them vulnerable to background noise interference; second, existing methods exhibit insufficient single-channel feature encoding capability. Therefore, this paper proposes Instance-Aware Sampling and VoxelTransformer Encoding for Single-Stage 3D Object Detection (IAVT-SSD). Specifically, we design an Instance- Aware Weighted Sampling Strategy to filter out ground reflection points, enhancing the model's focus on the foreground points. Meanwhile, we introduce a Voxel-Transformer Dual-Channel Feature Encoding Module to capture more comprehensive features through two independent channels, efficiently fusing non-empty voxels and remote context information. In addition, a Collaborative Enhancement Branch is designed to predict the complete structure of the object. Experiments show that IAVT-SSD achieves a good balance of accuracy and speed, with an inference speed of 42 FPS (frames per second) and a mAP (mean average precision) of 81.70% on the KITTI dataset, and a mAP of 66.96% on the ONCE dataset, validating its effectiveness and superiority.
引用
收藏
页数:15
相关论文
共 67 条
  • [1] Multimodal 3D Object Detection from Simulated Pretraining
    Brekke, Asmund
    Vatsendvik, Fredrik
    Lindseth, Frank
    [J]. NORDIC ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2019, 1056 : 102 - 113
  • [2] Multi-View 3D Object Detection Network for Autonomous Driving
    Chen, Xiaozhi
    Ma, Huimin
    Wan, Ji
    Li, Bo
    Xia, Tian
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6526 - 6534
  • [3] Not all points are balanced: Class balanced single-stage outdoor multi-class 3D object detector from point clouds
    Chen, Yidong
    Cai, Guorong
    Xia, Qiming
    Liu, Zhaoliang
    Zeng, Binghui
    Zhang, Zongliang
    Su, Jinhe
    Wang, Zongyue
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 128
  • [4] Deng JJ, 2021, AAAI CONF ARTIF INTE, V35, P1201
  • [5] Deng P., 2024, IEEE Sens. J.
  • [6] Vision meets robotics: The KITTI dataset
    Geiger, A.
    Lenz, P.
    Stiller, C.
    Urtasun, R.
    [J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2013, 32 (11) : 1231 - 1237
  • [7] Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
    He, Chenhang
    Li, Ruihuang
    Li, Shuai
    Zhang, Lei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8407 - 8417
  • [8] He CH, 2020, PROC CVPR IEEE, P11870, DOI 10.1109/CVPR42600.2020.01189
  • [9] Hoang H.A., 2024, IEEE Sens. J.
  • [10] Waveguide holography for 3D augmented reality glasses
    Jang, Changwon
    Bang, Kiseung
    Chae, Minseok
    Lee, Byoungho
    Lanman, Douglas
    [J]. NATURE COMMUNICATIONS, 2024, 15 (01)