VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

被引：6

作者：

Yokoyama, Naoki ^{[1
,2
]}

Ha, Sehoon ^{[2
]}

Batra, Dhruv ^{[2
]}

Wang, Jiuguang ^{[1
]}

Bucher, Bernadette ^{[1
]}

机构：

[1] Boston Dynam AI Inst, Boston, MA USA

[2] Georgia Inst Technol, Atlanta, GA 30332 USA

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024 | 2024年

关键词：

D O I：

10.1109/ICRA57147.2024.10610712

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real world deployment can be viewed at naoki.io/vlfm.

引用

页码：42 / 48

页数：7

共 50 条

[31] Decoupling Zero-Shot Semantic Segmentation
Ding, Jian
Xue, Nan
Xia, Gui-Song
Dai, Dengxin
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11573 - 11582
[32] Zero-Shot Semantic Parsing for Instructions
Givoli, Ofer
Reichart, Roi
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4454 - 4464
[33] Semantic Autoencoder for Zero-Shot Learning
Kodirov, Elyor
Xiang, Tao
Gong, Shaogang
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4447 - 4456
[34] Vision-language navigation: a survey and taxonomy
Wansen Wu
Tao Chang
Xinmeng Li
Quanjun Yin
Yue Hu
Neural Computing and Applications, 2024, 36 : 3291 - 3316
[35] Vision-language navigation: a survey and taxonomy
Wu, Wansen
Chang, Tao
Li, Xinmeng
Yin, Quanjun
Hu, Yue
NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07) : 3291 - 3316
[36] Instance-Level Semantic Maps for Vision Language Navigation
Nanwani, Laksh
Agarwal, Anmol
Jain, Kanishk
Prabhakar, Raghav
Monis, Aaron
Mathur, Aditya
Jatavallabhula, Krishna Murthy
Hafez, A. H. Abdul
Gandhi, Vineet
Krishna, K. Madhava
2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 507 - 512
[37] Zero-Shot Recommendation as Language Modeling
Sileo, Damien
Vossen, Wout
Raymaekers, Robbe
ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 223 - 230
[38] Zero-Shot Object Goal Visual Navigation
Zhao, Qianfan
Zhang, Lu
He, Bin
Qiao, Hong
Liu, Zhiyong
2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 2025 - 2031
[39] Towards Zero-shot Language Modeling
Ponti, Edoardo M.
Vulic, Ivan
Cotterell, Ryan
Reichart, Roi
Korhonen, Anna
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2900 - +
[40] Learning semantic ambiguities for zero-shot learning
Hanouti, Celina
Le Borgne, Herve
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (26) : 40745 - 40759

← 1 2 3 4 5 →