UniM2 AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

被引:0
作者
Zou, Jian [1 ]
Huang, Tianyu [1 ]
Yang, Guanglei [1 ]
Guo, Zhenhua [2 ]
Luo, Tao [3 ]
Feng, Chun-Mei [3 ]
Zuo, Wangmeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Tianyijiaotong Technol Ltd, Suzhou, Peoples R China
[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore
来源
COMPUTER VISION-ECCV 2024, PT XXII | 2025年 / 15080卷
关键词
Unified representation; sensor fusion; masked autoencoders;
D O I
10.1007/978-3-031-72670-5_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM(2) AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM(2) AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2% NDS and 6.5% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.
引用
收藏
页码:296 / 313
页数:18
相关论文
共 59 条
  • [1] Athar A, 2023, PR MACH LEARN RES, V229
  • [2] TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers
    Bai, Xuyang
    Hu, Zeyu
    Zhu, Xinge
    Huang, Qingqiu
    Chen, Yilun
    Fu, Hangbo
    Tai, Chiew-Lan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1080 - 1089
  • [3] X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation
    Borse, Shubhankar
    Klingner, Marvin
    Kumar, Varun Ravi
    Cai, Hong
    Almuzairee, Abdulaziz
    Yogamani, Senthil
    Porikli, Fatih
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3286 - 3296
  • [4] ALSO: Automotive Lidar Self-supervision by Occupancy estimation
    Boulch, Alexandre
    Sautier, Corentin
    Michele, Bjorn
    Puy, Gilles
    Marlet, Renaud
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 13455 - 13465
  • [5] Caesar H, 2020, PROC CVPR IEEE, P11618, DOI 10.1109/CVPR42600.2020.01164
  • [6] PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
    Chen, Anthony
    Zhang, Kevin
    Zhang, Renrui
    Wang, Zihan
    Lu, Yuheng
    Guo, Yandong
    Zhang, Shanghang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5291 - 5301
  • [7] Multi-View 3D Object Detection Network for Autonomous Driving
    Chen, Xiaozhi
    Ma, Huimin
    Wan, Ji
    Li, Bo
    Xia, Tian
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6526 - 6534
  • [8] FUTR3D: A Unified Sensor Fusion Framework for 3D Detection
    Chen, Xuanyao
    Zhang, Tianyuan
    Wang, Yue
    Wang, Yilun
    Zhao, Hang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 172 - 181
  • [9] FocalFormer3D: Focusing on Hard Instance for 3D Object Detection
    Chen, Yilun
    Yu, Zhiding
    Chen, Yukang
    Lan, Shiyi
    Anandkumar, Anima
    Jia, Jiaya
    Alvarez, Jose M.
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8360 - 8371
  • [10] LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs
    Chen, Yukang
    Liu, Jianhui
    Zhang, Xiangyu
    Qi, Xiaojuan
    Jia, Jiaya
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 13488 - 13498