EPMF: Efficient Perception-Aware Multi-Sensor Fusion for 3D Semantic Segmentation

被引:8
作者
Tan, Mingkui [1 ,2 ]
Zhuang, Zhuangwei [1 ,2 ]
Chen, Sitao [1 ]
Li, Rong [1 ]
Jia, Kui [3 ]
Wang, Qicheng [4 ,5 ]
Li, Yuanqing [2 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou 510641, Guangdong, Peoples R China
[2] Pazhou Lab, Guangzhou 510335, Peoples R China
[3] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510641, Guangdong, Peoples R China
[4] Hong Kong Univ Sci & Technol, Dept Math, Clear Water Bay, Hong Kong, Peoples R China
[5] Minieye, Shenzhen 518063, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Point cloud compression; Laser radar; Cameras; Semantic segmentation; Three-dimensional displays; Feature extraction; Sensors; 3D semantic segmentation; autonomous driving; deep neural networks; multi-sensor fusion; scene understanding;
D O I
10.1109/TPAMI.2024.3402232
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study multi-sensor fusion for 3D semantic segmentation that is important to scene understanding for many applications, such as autonomous driving and robotics. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between the two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to effectively exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we project point clouds to the camera coordinate using perspective projection, and process both inputs from LiDAR and cameras in 2D space while preventing the information loss of RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately. The extracted features are fused by effective residual-based fusion modules. Moreover, we introduce additional perception-aware losses to measure the perceptual difference between the two modalities. Last, we propose an improved version of PMF, i.e., EPMF, which is more efficient and effective by optimizing data pre-processing and network architecture under perspective projection. Specifically, we propose cross-modal alignment and cropping to obtain tight inputs and reduce unnecessary computational costs. We then explore more efficient contextual modules under perspective projection and fuse the LiDAR features into the camera stream to boost the performance of the two-stream network. Extensive experiments on benchmark data sets show the superiority of our method. For example, on nuScenes test set, our EPMF outperforms the state-of-the-art method, i.e., RangeFormer, by 0.9% in mIoU.
引用
收藏
页码:8258 / 8273
页数:16
相关论文
共 82 条
[11]   Masked-attention Mask Transformer for Universal Image Segmentation [J].
Cheng, Bowen ;
Misra, Ishan ;
Schwing, Alexander G. ;
Kirillov, Alexander ;
Girdhar, Rohit .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289
[12]   (AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network [J].
Cheng, Ran ;
Razani, Ryan ;
Taghavi, Ehsan ;
Li, Enxu ;
Liu, Bingbing .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12542-12551
[13]   4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks [J].
Choy, Christopher ;
Gwak, JunYoung ;
Savarese, Silvio .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3070-3079
[14]   The Cityscapes Dataset for Semantic Urban Scene Understanding [J].
Cordts, Marius ;
Omran, Mohamed ;
Ramos, Sebastian ;
Rehfeld, Timo ;
Enzweiler, Markus ;
Benenson, Rodrigo ;
Franke, Uwe ;
Roth, Stefan ;
Schiele, Bernt .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223
[15]  
Cortinhal Tiago, 2020, Advances in Visual Computing. 15th International Symposium, ISVC 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12510), P207, DOI 10.1007/978-3-030-64559-5_16
[16]   Pedestrian Detection: An Evaluation of the State of the Art [J].
Dollar, Piotr ;
Wojek, Christian ;
Schiele, Bernt ;
Perona, Pietro .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (04) :743-761
[17]  
El Madawi K, 2019, IEEE INT C INTELL TR, P7, DOI [10.1109/ITSC.2019.8917447, 10.1109/itsc.2019.8917447]
[18]  
Liong VE, 2020, Arxiv, DOI [arXiv:2012.04934, DOI 10.48550/ARXIV.2012.04934]
[19]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[20]   Self-supervised Moving Vehicle Tracking with Stereo Sound [J].
Gan, Chuang ;
Zhao, Hang ;
Chen, Peihao ;
Cox, David ;
Torralba, Antonio .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7052-7061