MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

被引：0

作者：

Dong, Shaocong ^{[1
]}

Ding, Lihe ^{[1
]}

Wang, Haiyang ^{[2
]}

Xu, Tingfa ^{[1
]}

Xu, Xinli ^{[1
]}

Bian, Ziyang ^{[1
]}

Wang, Ying ^{[1
]}

Wang, Jie ^{[1
]}

Li, Jianan ^{[1
]}

机构：

[1] Beijing Inst Technol, Beijing, Peoples R China

[2] Peking Univ, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

引用

页数：14

共 63 条

[31]

Shaw P., 2018, Self-attention with relative position representations, DOI DOI 10.18653/V1/N18-2074

[32]

Sheng Hualian, 2021, ICCV

[33]

Shi Guangsheng, 2022, PILLARNET REAL TIME

[34]

Shi Guangsheng, 2022, REAL TIME HIGH PERFO

[35] PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection [J].

Shi, Shaoshuai ;

Guo, Chaoxu ;

Jiang, Li ;

Wang, Zhe ;

Shi, Jianping ;

Wang, Xiaogang ;

Li, Hongsheng .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10526-10535

[36] PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud [J].

Shi, Shaoshuai ;

Wang, Xiaogang ;

Li, Hongsheng .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :770-779

[37]

Shi Shaoshuai, 2021, ARXIV210200463

[38]

Shi Shaoshuai, 2019, PART A 2 NET ED PART, V2

[39] Multi-view Convolutional Neural Networks for 3D Shape Recognition [J].

Su, Hang ;

Maji, Subhransu ;

Kalogerakis, Evangelos ;

Learned-Miller, Erik .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :945-953

[40] RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection [J].

Sun, Pei ;

Wang, Weiyue ;

Chai, Yuning ;

Elsayed, Gamaleldin ;

Bewley, Alex ;

Zhang, Xiao ;

Sminchisescu, Cristian ;

Anguelov, Dragomir .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5721-5730

← 1 2 3 4 5 6 7 →