UniM2 AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

被引：0

作者：

Zou, Jian ^{[1
]}

Huang, Tianyu ^{[1
]}

Yang, Guanglei ^{[1
]}

Guo, Zhenhua ^{[2
]}

Luo, Tao ^{[3
]}

Feng, Chun-Mei ^{[3
]}

Zuo, Wangmeng ^{[1
]}

机构：

[1] Harbin Inst Technol, Harbin, Peoples R China

[2] Tianyijiaotong Technol Ltd, Suzhou, Peoples R China

[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore

来源：

COMPUTER VISION-ECCV 2024, PT XXII | 2025年 / 15080卷

关键词：

Unified representation; sensor fusion; masked autoencoders;

D O I：

10.1007/978-3-031-72670-5_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM(2) AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM(2) AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2% NDS and 6.5% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.

引用

页码：296 / 313

页数：18

共 59 条

[1] Athar A, 2023, PR MACH LEARN RES, V229
[2] TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers
Bai, Xuyang
Hu, Zeyu
Zhu, Xinge
Huang, Qingqiu
Chen, Yilun
Fu, Hangbo
Tai, Chiew-Lan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1080 - 1089
[3] X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation
Borse, Shubhankar
Klingner, Marvin
Kumar, Varun Ravi
Cai, Hong
Almuzairee, Abdulaziz
Yogamani, Senthil
Porikli, Fatih
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3286 - 3296
[4] ALSO: Automotive Lidar Self-supervision by Occupancy estimation
Boulch, Alexandre
Sautier, Corentin
Michele, Bjorn
Puy, Gilles
Marlet, Renaud
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 13455 - 13465
[5] Caesar H, 2020, PROC CVPR IEEE, P11618, DOI 10.1109/CVPR42600.2020.01164
[6] PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
Chen, Anthony
Zhang, Kevin
Zhang, Renrui
Wang, Zihan
Lu, Yuheng
Guo, Yandong
Zhang, Shanghang
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5291 - 5301
[7] Multi-View 3D Object Detection Network for Autonomous Driving
Chen, Xiaozhi
Ma, Huimin
Wan, Ji
Li, Bo
Xia, Tian
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6526 - 6534
[8] FUTR3D: A Unified Sensor Fusion Framework for 3D Detection
Chen, Xuanyao
Zhang, Tianyuan
Wang, Yue
Wang, Yilun
Zhao, Hang
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 172 - 181
[9] FocalFormer3D: Focusing on Hard Instance for 3D Object Detection
Chen, Yilun
Yu, Zhiding
Chen, Yukang
Lan, Shiyi
Anandkumar, Anima
Jia, Jiaya
Alvarez, Jose M.
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8360 - 8371
[10] LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs
Chen, Yukang
Liu, Jianhui
Zhang, Xiangyu
Qi, Xiaojuan
Jia, Jiaya
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 13488 - 13498

← 1 2 3 4 5 6 →