MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

被引：0

作者：

Dong, Shaocong ^{[1
]}

Ding, Lihe ^{[1
]}

Wang, Haiyang ^{[2
]}

Xu, Tingfa ^{[1
]}

Xu, Xinli ^{[1
]}

Bian, Ziyang ^{[1
]}

Wang, Ying ^{[1
]}

Wang, Jie ^{[1
]}

Li, Jianan ^{[1
]}

机构：

[1] Beijing Inst Technol, Beijing, Peoples R China

[2] Peking Univ, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

引用

页数：14

共 63 条

[11]

He Chenhang, 2022, VOXEL SET TRANSFORME

[12] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[13] Densely Connected Convolutional Networks [J].

Huang, Gao ;

Liu, Zhuang ;

van der Maaten, Laurens ;

Weinberger, Kilian Q. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2261-2269

[14]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[15]

Ku J, 2018, IEEE INT C INT ROBOT, P5750, DOI 10.1109/IROS.2018.8594049

[16] Voxel-FPN: Multi-Scale Voxel Feature Aggregation for 3D Object Detection from LIDAR Point Clouds [J].

Kuang, Hongwu ;

Wang, Bei ;

An, Jianping ;

Zhang, Ming ;

Zhang, Zehan .

SENSORS, 2020, 20 (03)

[17] PointPillars: Fast Encoders for Object Detection from Point Clouds [J].

Lang, Alex H. ;

Vora, Sourabh ;

Caesar, Holger ;

Zhou, Lubing ;

Yang, Jiong ;

Beijbom, Oscar .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12689-12697

[18]

Li B, 2016, ROBOTICS: SCIENCE AND SYSTEMS XII

[19]

Li Yanghao, 2021, IMPROVED MULTISCALE

[20] Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes [J].

Li, Zhengqi ;

Niklaus, Simon ;

Snavely, Noah ;

Wang, Oliver .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :6494-6504

← 1 2 3 4 5 6 7 →