Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

被引：143

作者：

He, Chenhang ^{[1
]}

Li, Ruihuang ^{[1
]}

Li, Shuai ^{[1
]}

Zhang, Lei ^{[1
]}

机构：

[1] Hong Kong Polytech Univ, Hong Kong, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00823

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention fields. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross-attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones. VoxSeT reports competitive results on the KITH and Waymo detection benchmarks. The source codes can be found at https://gitgub.com/skyhehe123/VoxSet.

引用

页码：8407 / 8417

页数：11

共 50 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2]

Chai Yuning, 2021, P IEEE CVF C COMP VI, P16000

[3] SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation with Semi-supervised Learning [J].

Chen, Yujin ;

Tu, Zhigang ;

Ge, Liuhao ;

Zhang, Dejun ;

Chen, Ruizhi ;

Yuan, Junsong .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6960-6969

[4] Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis [J].

Dai, Angela ;

Qi, Charles Ruizhongtai ;

Niessner, Matthias .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6545-6554

[5]

Deng Jiajun, 2021, AAAI

[6]

Dosovitskiy A, 2021, ICLR

[7]

Fan L., 2021, Proc. IEEE/CVF Int. Conf. Computer Vision, P2918

[8] Timing Channel in IaaS: How to Identify and Investigate [J].

Fu, Xiao ;

Yang, Rui ;

Du, Xiaojiang ;

Luo, Bin ;

Guizani, Mohsen .

IEEE ACCESS, 2019, 7 :1-11

[9] Vision meets robotics: The KITTI dataset [J].

Geiger, A. ;

Lenz, P. ;

Stiller, C. ;

Urtasun, R. .

INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2013, 32 (11) :1231-1237

[10] PCT: Point cloud transformer [J].

Guo, Meng-Hao ;

Cai, Jun-Xiong ;

Liu, Zheng-Ning ;

Mu, Tai-Jiang ;

Martin, Ralph R. ;

Hu, Shi-Min .

COMPUTATIONAL VISUAL MEDIA, 2021, 7 (02) :187-199

← 1 2 3 4 5 →