MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

被引:0
作者
Dong, Shaocong [1 ]
Ding, Lihe [1 ]
Wang, Haiyang [2 ]
Xu, Tingfa [1 ]
Xu, Xinli [1 ]
Bian, Ziyang [1 ]
Wang, Ying [1 ]
Wang, Jie [1 ]
Li, Jianan [1 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Peking Univ, Beijing, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.
引用
收藏
页数:14
相关论文
共 63 条
[11]  
He Chenhang, 2022, VOXEL SET TRANSFORME
[12]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[13]   Densely Connected Convolutional Networks [J].
Huang, Gao ;
Liu, Zhuang ;
van der Maaten, Laurens ;
Weinberger, Kilian Q. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2261-2269
[14]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[15]  
Ku J, 2018, IEEE INT C INT ROBOT, P5750, DOI 10.1109/IROS.2018.8594049
[16]   Voxel-FPN: Multi-Scale Voxel Feature Aggregation for 3D Object Detection from LIDAR Point Clouds [J].
Kuang, Hongwu ;
Wang, Bei ;
An, Jianping ;
Zhang, Ming ;
Zhang, Zehan .
SENSORS, 2020, 20 (03)
[17]   PointPillars: Fast Encoders for Object Detection from Point Clouds [J].
Lang, Alex H. ;
Vora, Sourabh ;
Caesar, Holger ;
Zhou, Lubing ;
Yang, Jiong ;
Beijbom, Oscar .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12689-12697
[18]  
Li B, 2016, ROBOTICS: SCIENCE AND SYSTEMS XII
[19]  
Li Yanghao, 2021, IMPROVED MULTISCALE
[20]   Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes [J].
Li, Zhengqi ;
Niklaus, Simon ;
Snavely, Noah ;
Wang, Oliver .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :6494-6504