Instance-aware sampling and voxel-transformer encoding for single-stage 3D object detection

被引：0

作者：

Wang, Baotong ^{[1
]}

Xia, Chenxing ^{[1
]}

Gao, Xiuju ^{[2
]}

Yang, Yuan ^{[3
]}

Li, Kuan-Ching ^{[4
]}

Fang, Xianjin ^{[1
]}

Zhang, Yan ^{[5
]}

Ge, Sijia ^{[6
]}

机构：

[1] Anhui Univ Sci & Technol, Coll Comp Sci & Engn, Huainan 232001, Peoples R China

[2] Anhui Univ Sci & Technol, Coll Elect & Informat Engn, Huainan 232001, Peoples R China

[3] Anhui Univ Sci & Technol, Sch Math & Big Data, Huainan 232001, Peoples R China

[4] Providence Univ, Dept Comp Sci & Informat Engn, Taichung 43301, Taiwan

[5] Anhui Univ, Sch Elect & Informat Engn, Hefei 230039, Peoples R China

[6] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei 230039, Peoples R China

来源：

DIGITAL SIGNAL PROCESSING | 2025年 / 162卷

关键词：

Collaborative enhancement; Dual-channel; Object detection; Point cloud; Weighted sampling;

D O I：

10.1016/j.dsp.2025.105171

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In point cloud 3D object detection tasks, single-stage detectors offer fast inference but are less accurate than two-stage detectors. We point out two main problems: first, traditional methods deal with the whole point cloud, making them vulnerable to background noise interference; second, existing methods exhibit insufficient single-channel feature encoding capability. Therefore, this paper proposes Instance-Aware Sampling and VoxelTransformer Encoding for Single-Stage 3D Object Detection (IAVT-SSD). Specifically, we design an Instance- Aware Weighted Sampling Strategy to filter out ground reflection points, enhancing the model's focus on the foreground points. Meanwhile, we introduce a Voxel-Transformer Dual-Channel Feature Encoding Module to capture more comprehensive features through two independent channels, efficiently fusing non-empty voxels and remote context information. In addition, a Collaborative Enhancement Branch is designed to predict the complete structure of the object. Experiments show that IAVT-SSD achieves a good balance of accuracy and speed, with an inference speed of 42 FPS (frames per second) and a mAP (mean average precision) of 81.70% on the KITTI dataset, and a mAP of 66.96% on the ONCE dataset, validating its effectiveness and superiority.

引用

页数：15

共 67 条

[1] Multimodal 3D Object Detection from Simulated Pretraining
Brekke, Asmund
Vatsendvik, Fredrik
Lindseth, Frank
[J]. NORDIC ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2019, 1056 : 102 - 113
[2] Multi-View 3D Object Detection Network for Autonomous Driving
Chen, Xiaozhi
Ma, Huimin
Wan, Ji
Li, Bo
Xia, Tian
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6526 - 6534
[3] Not all points are balanced: Class balanced single-stage outdoor multi-class 3D object detector from point clouds
Chen, Yidong
Cai, Guorong
Xia, Qiming
Liu, Zhaoliang
Zeng, Binghui
Zhang, Zongliang
Su, Jinhe
Wang, Zongyue
[J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 128
[4] Deng JJ, 2021, AAAI CONF ARTIF INTE, V35, P1201
[5] Deng P., 2024, IEEE Sens. J.
[6] Vision meets robotics: The KITTI dataset
Geiger, A.
Lenz, P.
Stiller, C.
Urtasun, R.
[J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2013, 32 (11) : 1231 - 1237
[7] Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
He, Chenhang
Li, Ruihuang
Li, Shuai
Zhang, Lei
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8407 - 8417
[8] He CH, 2020, PROC CVPR IEEE, P11870, DOI 10.1109/CVPR42600.2020.01189
[9] Hoang H.A., 2024, IEEE Sens. J.
[10] Waveguide holography for 3D augmented reality glasses
Jang, Changwon
Bang, Kiseung
Chae, Minseok
Lee, Byoungho
Lanman, Douglas
[J]. NATURE COMMUNICATIONS, 2024, 15 (01)

← 1 2 3 4 5 6 7 →