Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

被引:143
作者
He, Chenhang [1 ]
Li, Ruihuang [1 ]
Li, Shuai [1 ]
Zhang, Lei [1 ]
机构
[1] Hong Kong Polytech Univ, Hong Kong, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00823
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention fields. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross-attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones. VoxSeT reports competitive results on the KITH and Waymo detection benchmarks. The source codes can be found at https://gitgub.com/skyhehe123/VoxSet.
引用
收藏
页码:8407 / 8417
页数:11
相关论文
共 50 条
[1]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[2]  
Chai Yuning, 2021, P IEEE CVF C COMP VI, P16000
[3]   SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation with Semi-supervised Learning [J].
Chen, Yujin ;
Tu, Zhigang ;
Ge, Liuhao ;
Zhang, Dejun ;
Chen, Ruizhi ;
Yuan, Junsong .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6960-6969
[4]   Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis [J].
Dai, Angela ;
Qi, Charles Ruizhongtai ;
Niessner, Matthias .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6545-6554
[5]  
Deng Jiajun, 2021, AAAI
[6]  
Dosovitskiy A, 2021, ICLR
[7]  
Fan L., 2021, Proc. IEEE/CVF Int. Conf. Computer Vision, P2918
[8]   Timing Channel in IaaS: How to Identify and Investigate [J].
Fu, Xiao ;
Yang, Rui ;
Du, Xiaojiang ;
Luo, Bin ;
Guizani, Mohsen .
IEEE ACCESS, 2019, 7 :1-11
[9]   Vision meets robotics: The KITTI dataset [J].
Geiger, A. ;
Lenz, P. ;
Stiller, C. ;
Urtasun, R. .
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2013, 32 (11) :1231-1237
[10]   PCT: Point cloud transformer [J].
Guo, Meng-Hao ;
Cai, Jun-Xiong ;
Liu, Zheng-Ning ;
Mu, Tai-Jiang ;
Martin, Ralph R. ;
Hu, Shi-Min .
COMPUTATIONAL VISUAL MEDIA, 2021, 7 (02) :187-199