DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

被引:1
作者
Wang, Junyin [1 ]
Du, Chenghu [1 ]
Li, Hui [2 ]
Xiong, Shengwu [3 ]
机构
[1] Wuhan Univ Technol, Wuhan, Peoples R China
[2] Qingdao Univ Sci & Technol, Qingdao, Peoples R China
[3] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
point cloud; multimodal fusion; depth estimation; 3D object detection;
D O I
10.1145/3581783.3612344
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Surround-view cameras combined with image depth transformation to 3D feature space and fusion with point cloud features are highly regarded. The transformation of 2D features into 3D feature space by means of predefined sampling points and depth distribution happens throughout the scene, and this process generates a large number of redundant features. In addition, multimodal feature fusion unified in 3D space often happens in the previous step of the downstream task, ignoring the interactive fusion between different scales. To this end, we design a new framework, focusing on the design that can give 3D geometric perception information to images and unify them into voxel space to accomplish multi-scale interactive fusion, and we mitigate feature alignment between modal features by geometric relationships between voxel features. The method has two main designs. First, a Segmentation-guided Image View Transformation module is used to accurately transform the pixel region containing the object into a 3D pseudo-point voxel space with the help of a depth distribution. This allows subsequent feature fusion to be performed in a unified voxel feature. Secondly, a Voxel-centric Consistent Fusion module is used to alleviate the errors caused by depth estimation, as well as to achieve better feature fusion between unified modalities. Through extensive experiments on the KITTI and nuScenes datasets, we validate the effectiveness of our camera-LIDAR fusion method. Our proposed approach shows competitive performance on both datasets and outperforms state-of-the-art methods in certain classes of 3D object detection benchmarks.[code release]
引用
收藏
页码:3765 / 3776
页数:12
相关论文
共 68 条
[61]   LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention [J].
Yin, Junbo ;
Shen, Jianbing ;
Guan, Chenye ;
Zhou, Dingfu ;
Yang, Ruigang .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11492-11501
[62]   Center-based 3D Object Detection and Tracking [J].
Yin, Tianwei ;
Zhou, Xingyi ;
Krahenbuhl, Philipp .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11779-11788
[63]  
Yin Tianwei., 2021, Adv. Neural Inf. Process. Syst., V34, P16494
[64]  
YOO JH, 2020, EUR C COMP VIS, DOI DOI 10.1109/CEFC46938.2020.9451336
[65]  
Zhao Lin, 2021, ABS210807511 ARXIV
[66]   Cross-view Transformers for real-time Map-view Semantic Segmentation [J].
Zhou, Brady ;
Krahenbuhl, Philipp .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13750-13759
[67]  
Zhou Hongyu, 2022, ABS220809394 ARXIV
[68]   VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection [J].
Zhou, Yin ;
Tuzel, Oncel .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4490-4499