Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles

被引:6
作者
Uzair, Muhammad [1 ]
Dong, Jian [2 ]
Shi, Ronghua [2 ]
Mushtaq, Husnain [1 ]
Ullah, Irshad [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China
[2] Cent South Univ, Sch Elect Informat, Changsha 410083, Peoples R China
来源
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2024年 / 62卷
基金
中国国家自然科学基金;
关键词
Feature extraction; Three-dimensional displays; Laser radar; Point cloud compression; Object detection; Semantics; Cameras; Object recognition; Convolutional neural networks; Accuracy; 3-D object detection; class-based point sampling; multimodal fusion; self-attention; semantic feature learning;
D O I
10.1109/TGRS.2024.3476072
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Accurate 3-D object detection is vital in autonomous driving. Traditional LiDAR models struggle with sparse point clouds. We propose a novel approach integrating LiDAR and camera data to maximize sensor strengths while overcoming individual limitations for enhanced 3-D object detection. Our research introduces the channelwise and spatially guided multimodal feature fusion network (CSMNET) for 3-D object detection. First, our method enhances LiDAR data by projecting it onto a 2-D plane, enabling the extraction of class-specific features from a probability map. Second, we design class-based farthest point sampling (C-FPS), which boosts the selection of foreground points by utilizing point weights based on geometric or probability features while ensuring diversity among the selected points. Third, we developed a parallel attention (PAT)-based multimodal fusion mechanism achieving higher resolution compared to raw LiDAR points. This fusion mechanism integrates two attention mechanisms: channel attention for LiDAR data and spatial attention for camera data. These mechanisms enhance the utilization of semantic features in a region of interest (ROI) to obtain more representative point features, leading to a more effective fusion of information from both LiDAR and camera sources. Specifically, CSMNET achieves an average precision (AP) in bird's eye view (BEV) detection of 90.16% (easy), 85.18% (moderate), and 80.51% (hard), with a mean AP (mAP) of 85.12%. In 3-D detection, CSMNET attains 82.05% (easy), 72.64% (moderate), and 67.10% (hard) with an mAP of 73.75%. For 2-D detection, the scores are 95.47% (easy), 93.25% (moderate), and 86.68% (hard), yielding an mAP of 91.72% for the KITTI dataset.
引用
收藏
页数:15
相关论文
共 71 条
[1]   Pointwise Convolutional Neural Networks [J].
Binh-Son Hua ;
Minh-Khoi Tran ;
Yeung, Sai-Kit .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :984-993
[2]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[3]   Background-Aware 3-D Point Cloud Segmentation With Dynamic Point Feature Aggregation [J].
Chen, Jiajing ;
Kakillioglu, Burak ;
Velipasalar, Senem .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[4]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[5]   Multi-View 3D Object Detection Network for Autonomous Driving [J].
Chen, Xiaozhi ;
Ma, Huimin ;
Wan, Ji ;
Li, Bo ;
Xia, Tian .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6526-6534
[6]  
Chen YL, 2019, IEEE I CONF COMP VIS, P9774, DOI [10.1109/ICCV.2019.00987, 10.1109/iccv.2019.00987]
[7]  
Deng JJ, 2021, AAAI CONF ARTIF INTE, V35, P1201
[8]  
Dosovitskiy A., 2021, Int. Conf. Learn. Represent
[9]   Point Transformer [J].
Engel, Nico ;
Belagiannis, Vasileios ;
Dietmayer, Klaus .
IEEE ACCESS, 2021, 9 :134826-134840
[10]  
Fei J., 2020, SemanticVox- els. Sequential Fusion for 3D Pedestrian Detection Using LiDAR Point Cloud and Semantic Segmentation, P185