DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

被引：1

作者：

Wang, Junyin ^{[1
]}

Du, Chenghu ^{[1
]}

Li, Hui ^{[2
]}

Xiong, Shengwu ^{[3
]}

机构：

[1] Wuhan Univ Technol, Wuhan, Peoples R China

[2] Qingdao Univ Sci & Technol, Qingdao, Peoples R China

[3] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

point cloud; multimodal fusion; depth estimation; 3D object detection;

D O I：

10.1145/3581783.3612344

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Surround-view cameras combined with image depth transformation to 3D feature space and fusion with point cloud features are highly regarded. The transformation of 2D features into 3D feature space by means of predefined sampling points and depth distribution happens throughout the scene, and this process generates a large number of redundant features. In addition, multimodal feature fusion unified in 3D space often happens in the previous step of the downstream task, ignoring the interactive fusion between different scales. To this end, we design a new framework, focusing on the design that can give 3D geometric perception information to images and unify them into voxel space to accomplish multi-scale interactive fusion, and we mitigate feature alignment between modal features by geometric relationships between voxel features. The method has two main designs. First, a Segmentation-guided Image View Transformation module is used to accurately transform the pixel region containing the object into a 3D pseudo-point voxel space with the help of a depth distribution. This allows subsequent feature fusion to be performed in a unified voxel feature. Secondly, a Voxel-centric Consistent Fusion module is used to alleviate the errors caused by depth estimation, as well as to achieve better feature fusion between unified modalities. Through extensive experiments on the KITTI and nuScenes datasets, we validate the effectiveness of our camera-LIDAR fusion method. Our proposed approach shows competitive performance on both datasets and outperforms state-of-the-art methods in certain classes of 3D object detection benchmarks.[code release]

引用

页码：3765 / 3776

页数：12

共 69 条

[1] [Anonymous], 2020, EUR C COMP VIS, DOI DOI 10.5220/0009422300680077
[2] Ashish V., 2017, Advances in neural information processing systems, V30
[3] Bai X., 2022, P IEEE CVF C COMP VI, P1090, DOI DOI 10.1109/CVPR52688.2022.00116
[4] M3D-RPN: Monocular 3D Region Proposal Network for Object Detection
Brazil, Garrick
Liu, Xiaoming
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9286 - 9295
[5] Caesar H, 2020, PROC CVPR IEEE, P11618, DOI 10.1109/CVPR42600.2020.01164
[6] Carion N., 2020, P EUR C COMP VIS GLA, P213, DOI DOI 10.1007/978-3-030-58452-813
[7] Chai Wang, 2021, 2021 IEEE 21st International Conference on Communication Technology (ICCT), P1178, DOI 10.1109/ICCT52962.2021.9657987
[8] Chen Q., 2020, Advances in Neural Information Processing Systems
[9] Chen Qi, 2021, ARXIV210607545
[10] Chen X., 2017, PROC CVPR IEEE, V1, P3, DOI [DOI 10.1109/CVPR.2017.691, 10.1109/CVPR.2017.691]

← 1 2 3 4 5 6 7 →