ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language

被引:131
作者
Chen, Dave Zhenyu [1 ]
Chang, Angel X. [2 ]
Niessner, Matthias [1 ]
机构
[1] Tech Univ Munich, Munich, Germany
[2] Simon Fraser Univ, Burnaby, BC, Canada
来源
COMPUTER VISION - ECCV 2020, PT XX | 2020年 / 12365卷
关键词
D O I
10.1007/978-3-030-58565-5_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51, 583 descriptions of 11, 046 objects from 800 ScanNet [8] scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D (Code: https://daveredrum.github.io/ScanRefer/).
引用
收藏
页码:202 / 221
页数:20
相关论文
共 73 条
[41]  
Narita G, 2019, Arxiv, DOI arXiv:1903.01177
[42]  
Nguyen A, 2018, Arxiv, DOI arXiv:1803.06152
[43]  
Paszke A, 2016, Arxiv, DOI [arXiv:1606.02147, 10.48550/arXiv.1606.02147, DOI 10.48550/ARXIV.1606.02147]
[44]  
Pennington J, 2014, P 2014 C EMP METH NA, DOI [DOI 10.3115/V1/D14-1162, 10.3115/v1/D14-1162, 10.3115/v1/d14-1162]
[45]   Conditional Image-Text Embedding Networks [J].
Plummer, Bryan A. ;
Kordas, Paige ;
Kiapour, M. Hadi ;
Zheng, Shuai ;
Piramuthu, Robinson ;
Lazebnik, Svetlana .
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :258-274
[46]   Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models [J].
Plummer, Bryan A. ;
Wang, Liwei ;
Cervantes, Chris M. ;
Caicedo, Juan C. ;
Hockenmaier, Julia ;
Lazebnik, Svetlana .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2641-2649
[47]  
Prabhudesai M, 2021, Arxiv, DOI arXiv:1910.01210
[48]   Deep Hough Voting for 3D Object Detection in Point Clouds [J].
Qi, Charles R. ;
Litany, Or ;
He, Kaiming ;
Guibas, Leonidas J. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9276-9285
[49]  
Qi Charles Ruizhongtai, 2017, PROC 31 INT C NEURAL
[50]   REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments [J].
Qi, Yuankai ;
Wu, Qi ;
Anderson, Peter ;
Wang, Xin ;
Wang, William Yang ;
Shen, Chunhua ;
van den Hengel, Anton .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9979-9988