VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

被引：4

作者：

Yokoyama, Naoki ^{[1
,2
]}

Ha, Sehoon ^{[2
]}

Batra, Dhruv ^{[2
]}

Wang, Jiuguang ^{[1
]}

Bucher, Bernadette ^{[1
]}

机构：

[1] Boston Dynam AI Inst, Boston, MA USA

[2] Georgia Inst Technol, Atlanta, GA 30332 USA

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024 | 2024年

关键词：

D O I：

10.1109/ICRA57147.2024.10610712

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real world deployment can be viewed at naoki.io/vlfm.

引用

页码：42 / 48

页数：7

共 50 条

[1] Label Propagation for Zero-shot Classification with Vision-Language Models
Stojnic, Vladan
Kalantidis, Yannis
Tolias, Giorgos
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23209 - 23218
[2] VLPSR: Enhancing Zero-Shot Object ReID with Vision-Language Model
Hu, Mingzhe
ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT II, 2025, 15047 : 56 - 69
[3] CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment
Javed, Sajid
Mahmood, Arif
Ganapathil, Iyyakutti Iyappan
Dharej, Fayaz Ali
Werghil, Naoufel
Bennamoun, Mohammed
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 11450 - 11459
[4] Vision-Language Models for Zero-Shot Classification of Remote Sensing Images
Al Rahhal, Mohamad Mahmoud
Bazi, Yakoub
Elgibreen, Hebah
Zuair, Mansour
APPLIED SCIENCES-BASEL, 2023, 13 (22):
[5] Zero-shot Object Detection Through Vision-Language Embedding Alignment
Xie, Johnathan
Zheng, Shuai
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW, 2022, : 926 - 940
[6] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
Zhai, Wenzhe
Xing, Xianglei
Gao, Mingliang
Li, Qilei
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
[7] Zero-Shot Temporal Action Detection via Vision-Language Prompting
Nag, Sauradip
Zhu, Xiatian
Song, Yi-Zhe
Xiang, Tao
COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 681 - 697
[8] Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
Zheng, Zangwei
Ma, Mingyuan
Wang, Kai
Qin, Ziheng
Yue, Xiangyu
You, Yang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19068 - 19079
[9] EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Foteinopoulou, Niki Maria
Patras, Ioannis
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
[10] Inference Calibration of Vision-Language Foundation Models for Zero-Shot and Few-Shot Learning
Hu, Minyang
Chang, Hong
Shan, Shiguang
Chen, Xilin
PATTERN RECOGNITION LETTERS, 2025, 192 : 15 - 21

← 1 2 3 4 5 →