MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

被引:0
作者
Dong, Shaocong [1 ]
Ding, Lihe [1 ]
Wang, Haiyang [2 ]
Xu, Tingfa [1 ]
Xu, Xinli [1 ]
Bian, Ziyang [1 ]
Wang, Ying [1 ]
Wang, Jie [1 ]
Li, Jianan [1 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Peking Univ, Beijing, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.
引用
收藏
页数:14
相关论文
共 63 条
[31]  
Shaw P., 2018, Self-attention with relative position representations, DOI DOI 10.18653/V1/N18-2074
[32]  
Sheng Hualian, 2021, ICCV
[33]  
Shi Guangsheng, 2022, PILLARNET REAL TIME
[34]  
Shi Guangsheng, 2022, REAL TIME HIGH PERFO
[35]   PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection [J].
Shi, Shaoshuai ;
Guo, Chaoxu ;
Jiang, Li ;
Wang, Zhe ;
Shi, Jianping ;
Wang, Xiaogang ;
Li, Hongsheng .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10526-10535
[36]   PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud [J].
Shi, Shaoshuai ;
Wang, Xiaogang ;
Li, Hongsheng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :770-779
[37]  
Shi Shaoshuai, 2021, ARXIV210200463
[38]  
Shi Shaoshuai, 2019, PART A 2 NET ED PART, V2
[39]   Multi-view Convolutional Neural Networks for 3D Shape Recognition [J].
Su, Hang ;
Maji, Subhransu ;
Kalogerakis, Evangelos ;
Learned-Miller, Erik .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :945-953
[40]   RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection [J].
Sun, Pei ;
Wang, Weiyue ;
Chai, Yuning ;
Elsayed, Gamaleldin ;
Bewley, Alex ;
Zhang, Xiao ;
Sminchisescu, Cristian ;
Anguelov, Dragomir .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5721-5730