3D Visual Grounding-Audio: 3D scene object detection based on audio

被引:0
作者
Zhang, Can
Cai, Zeyu
Chen, Xunhao
Da, Feipeng
Gai, Shaoyan [1 ]
机构
[1] Southeast Univ, Sch Automat, Sipailou 2, Nanjing 210096, Jiangsu, Peoples R China
关键词
Point cloud; Audio; 3D visual grounding; Multi-modal; AUDIOVISUAL SEGMENTATION; FUSION;
D O I
10.1016/j.neucom.2024.128637
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
3D Visual Grounding (3DVG) is a prevalent multi-modal information fusion task capable of accurately localizing target objects referenced in natural language descriptions within a point cloud scene. Nevertheless, the stringent demands for input and output devices present substantial hurdles for the application and integration of 3DVG in fields like remote robotic control and telemedicine. To address this challenge, we introduce several innovative approaches. Firstly, we have initiated a novel multi-modal task, termed 3D Visual Grounding-Audio (3DVG-Audio), which is based on the fusion of audio and point cloud. To the best of our knowledge, this represents the first instance of an Audio-Point Cloud multi-modal task. 3DVG-Audio achieves precise localization of audio-mentioned objects within the point cloud by utilizing the point cloud in conjunction with the corresponding audio input. Secondly, building upon the ScanRefer, we have developed a novel dataset named 3DVG-AudioSet, specifically designed for the training and evaluation of the 3DVG-Audio method. Finally, we have crafted a tailored loss function to further enhance the performance of 3DVG-Audio and introduced a method named AP-Refer, which serves as a benchmark for the task. Extensive experimental results demonstrate the potential for deep integration of audio and point cloud to tackle complex real-world challenges. AP-Refer has successfully addressed the 3DVG-Audio, circumventing the limitations of conventional 3DVG methods, and exhibits significant application potential.
引用
收藏
页数:10
相关论文
共 58 条
[1]   ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes [J].
Achlioptas, Panos ;
Abdelreheem, Ahmed ;
Xia, Fei ;
Elhoseiny, Mohamed ;
Guibas, Leonidas .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :422-440
[2]  
[Anonymous], 2000, ORGAN SOUND, DOI [10.1017/S13557718 00003071, DOI 10.1017/S1355771800003071]
[3]   ScanQA: 3D Question Answering for Spatial Scene Understanding [J].
Azuma, Laichi ;
Miyanishi, Taiki ;
Kurita, Shuhei ;
Kawanahe, Motoaki .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :19107-19117
[4]   SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences [J].
Behley, Jens ;
Garbade, Martin ;
Milioto, Andres ;
Quenzel, Jan ;
Behnke, Sven ;
Stachniss, Cyrill ;
Gall, Juergen .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9296-9306
[5]   3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds [J].
Cai, Daigang ;
Zhao, Lichen ;
Zhang, Jing ;
Sheng, Lu ;
Xu, Dong .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16443-16452
[6]   ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language [J].
Chen, Dave Zhenyu ;
Chang, Angel X. ;
Niessner, Matthias .
COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :202-221
[7]   FocalFormer3D: Focusing on Hard Instance for 3D Object Detection [J].
Chen, Yilun ;
Yu, Zhiding ;
Chen, Yukang ;
Lan, Shiyi ;
Anandkumar, Anima ;
Jia, Jiaya ;
Alvarez, Jose M. .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :8360-8371
[8]   Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [J].
Chen, Ziyang ;
Gebru, Israel D. ;
Richardt, Christian ;
Kumar, Anurag ;
Laney, William ;
Owens, Andrew ;
Richard, Alexander .
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, :21886-21896
[9]  
Cheng ZY, 2024, Arxiv, DOI arXiv:2304.14614
[10]  
Collobert R., 2019, arXiv, DOI DOI 10.48550/ARXIV.1904.05862