HCPVF: Hierarchical Cascaded Point-Voxel Fusion for 3D Object Detection

被引:6
作者
Fan, Baojie [1 ,2 ,3 ]
Zhang, Kexin [1 ,2 ,3 ]
Tian, Jiandong [4 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Automat, Nanjing 210023, Peoples R China
[2] Nanjing Univ Posts & Telecommun, Coll Artificial Intelligence, Nanjing 210023, Peoples R China
[3] Wuzhou Univ, Guangxi Key Lab Machine Vis & Intelligent Control, Wuzhou 543002, Peoples R China
[4] Chinese Acad Sci, Shenyang Inst Automat, State Key Lab Robot, Beijing 100045, Peoples R China
基金
中国国家自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Point cloud compression; Proposals; Object detection; Detectors; Transformers; 3D object detection; BEV; voxel; point cloud;
D O I
10.1109/TCSVT.2023.3268849
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the astonishing development of 3D sensors, point cloud based 3D object detection is attracting increasing attention from both industry and academia, and widely applied in various fields, such as robotics and autonomous driving. However, how to balance the 3D object detecting accuracy and speed is still a challenging problem. In this paper, we study this issue and propose a novel and effective 3D point cloudy object detection network based on hierarchical cascaded point-voxel fusion, called HCPVF. Firstly, a novel bird's-eye-view(BEV) attention mechanism with linear complexity is developed to improve point cloud feature backbone network, which can be implemented easily to mine the point-to-point similarity in BEV's view, by two cascaded linear layers and two normalization layers. This operation captures long-range dependencies and reduces the uneven sampling of sparse BEV features, making the extracted point cloudy features more discriminative. Secondly, the proposed HCPVF module is equipped with dual-level hierarchical cascaded detection head, including voxel level and the following point level. The voxel level is composed of coarse Region of interest(RoI) pooling and fine RoI pooling, which are cooperated to aggregate voxel features from different grid divisions and predict relatively coarse detection boxes. In the following, the point level is based on Key Points Transformer. It firstly encodes the spatial context information between the original point and the voxel level box. And then, a novel dual-weighted decoder is developed to enhance the context interaction by weighting the channel and spatial dimensions to obtain more accurate detection results. This design utilizes the voxel based method with high computational efficiency and the point based method with more complete spatial information, fusing low-level voxel features and high-level point features through hierarchical cascaded strategy. Extensive experiments demonstate that the proposed HCPVF achieves state-of-the-art 3D detection performance while maintaining computational efficiency on both the Waymo Open Dataset and the highly-competitive KITTI benchmark.
引用
收藏
页码:8997 / 9009
页数:13
相关论文
共 55 条
  • [1] TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers
    Bai, Xuyang
    Hu, Zeyu
    Zhu, Xinge
    Huang, Qingqiu
    Chen, Yilun
    Fu, Hangbo
    Tai, Chiew-Lan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1080 - 1089
  • [2] Cascade R-CNN: Delving into High Quality Object Detection
    Cai, Zhaowei
    Vasconcelos, Nuno
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6154 - 6162
  • [3] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [4] Multi-View 3D Object Detection Network for Autonomous Driving
    Chen, Xiaozhi
    Ma, Huimin
    Wan, Ji
    Li, Bo
    Xia, Tian
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6526 - 6534
  • [5] Focal Sparse Convolutional Networks for 3D Object Detection
    Chen, Yukang
    Li, Yanwei
    Zhang, Xiangyu
    Sun, Jian
    Jia, Jiaya
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5418 - 5427
  • [6] Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting
    Chu, Xiaomeng
    Deng, Jiajun
    Li, Yao
    Yuan, Zhenxun
    Zhang, Yanyong
    Ji, Jianmin
    Zhang, Yu
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5239 - 5247
  • [7] From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection
    Deng, Jiajun
    Zhou, Wengang
    Zhang, Yanyong
    Li, Houqiang
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (12) : 4722 - 4734
  • [8] Deng JJ, 2021, AAAI CONF ARTIF INTE, V35, P1201
  • [9] Embracing Single Stride 3D Object Detector with Sparse Transformer
    Fan, Lue
    Pang, Ziqi
    Zhang, Tianyuan
    Wang, Yu-Xiong
    Zhao, Hang
    Wang, Feng
    Wang, Naiyan
    Zhang, Zhaoxiang
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8448 - 8458
  • [10] ESGN: Efficient Stereo Geometry Network for Fast 3D Object Detection
    Gao, Aqi
    Pang, Yanwei
    Nie, Jing
    Shao, Zhuang
    Cao, Jiale
    Guo, Yishun
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2000 - 2009