WildRefer: 3D Object Localization in Large-Scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

被引:0
作者
Lin, Zhenxiang [1 ]
Peng, Xidong [1 ]
Cong, Peishan [1 ]
Zhang, Ge [1 ]
Sung, Yujin [2 ]
Hou, Yuenan [3 ]
Zhu, Xinge [4 ]
Yang, Sibei [1 ]
Ma, Yuexin [1 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Univ Hong Kong, Hong Kong, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
[4] Chinese Univ Hong Kong, Hong Kong, Peoples R China
来源
COMPUTER VISION-ECCV 2024, PT XLVI | 2025年 / 15104卷
基金
上海市自然科学基金;
关键词
D O I
10.1007/978-3-031-72952-2_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data, including 2D images and 3D LiDAR point clouds. We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud as well as the semantic knowledge of language descriptions. Besides, we propose two novel datasets, i.e., STRefer and LifeRefer, which focus on large-scale human-centric daily-life scenarios accompanied with abundant 3D object and natural language annotations. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots. Extensive experiments and ablation studies demonstrate that our method achieves state-of-the-art performance on the proposed benchmarks. The code is provided in https://github.com/4DVLab/WildRefer.
引用
收藏
页码:456 / 473
页数:18
相关论文
共 61 条
[1]   ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes [J].
Achlioptas, Panos ;
Abdelreheem, Ahmed ;
Xia, Fei ;
Elhoseiny, Mohamed ;
Guibas, Leonidas .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :422-440
[2]   TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers [J].
Bai, Xuyang ;
Hu, Zeyu ;
Zhu, Xinge ;
Huang, Qingqiu ;
Chen, Yilun ;
Fu, Hangbo ;
Tai, Chiew-Lan .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1080-1089
[3]   SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences [J].
Behley, Jens ;
Garbade, Martin ;
Milioto, Andres ;
Quenzel, Jan ;
Behnke, Sven ;
Stachniss, Cyrill ;
Gall, Juergen .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9296-9306
[4]   nuScenes: A multimodal dataset for autonomous driving [J].
Caesar, Holger ;
Bankiti, Varun ;
Lang, Alex H. ;
Vora, Sourabh ;
Liong, Venice Erin ;
Xu, Qiang ;
Krishnan, Anush ;
Pan, Yu ;
Baldan, Giancarlo ;
Beijbom, Oscar .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628
[5]   3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds [J].
Cai, Daigang ;
Zhao, Lichen ;
Zhang, Jing ;
Sheng, Lu ;
Xu, Dong .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16443-16452
[6]   UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding [J].
Chen, Dave Zhenyu ;
Hu, Ronghang ;
Chen, Xinlei ;
Niessner, Matthias ;
Chang, Angel X. .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :18063-18073
[7]   ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language [J].
Chen, Dave Zhenyu ;
Chang, Angel X. ;
Niessner, Matthias .
COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :202-221
[8]  
Chen JM, 2023, Arxiv, DOI arXiv:2210.12513
[9]  
Chen XY, 2023, Arxiv, DOI arXiv:2203.10642
[10]  
Cho J, 2021, PR MACH LEARN RES, V139