VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

被引:4
|
作者
Yokoyama, Naoki [1 ,2 ]
Ha, Sehoon [2 ]
Batra, Dhruv [2 ]
Wang, Jiuguang [1 ]
Bucher, Bernadette [1 ]
机构
[1] Boston Dynam AI Inst, Boston, MA USA
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024 | 2024年
关键词
D O I
10.1109/ICRA57147.2024.10610712
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real world deployment can be viewed at naoki.io/vlfm.
引用
收藏
页码:42 / 48
页数:7
相关论文
共 50 条
  • [1] Label Propagation for Zero-shot Classification with Vision-Language Models
    Stojnic, Vladan
    Kalantidis, Yannis
    Tolias, Giorgos
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23209 - 23218
  • [2] VLPSR: Enhancing Zero-Shot Object ReID with Vision-Language Model
    Hu, Mingzhe
    ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT II, 2025, 15047 : 56 - 69
  • [3] CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment
    Javed, Sajid
    Mahmood, Arif
    Ganapathil, Iyyakutti Iyappan
    Dharej, Fayaz Ali
    Werghil, Naoufel
    Bennamoun, Mohammed
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 11450 - 11459
  • [4] Vision-Language Models for Zero-Shot Classification of Remote Sensing Images
    Al Rahhal, Mohamad Mahmoud
    Bazi, Yakoub
    Elgibreen, Hebah
    Zuair, Mansour
    APPLIED SCIENCES-BASEL, 2023, 13 (22):
  • [5] Zero-shot Object Detection Through Vision-Language Embedding Alignment
    Xie, Johnathan
    Zheng, Shuai
    2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW, 2022, : 926 - 940
  • [6] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
    Zhai, Wenzhe
    Xing, Xianglei
    Gao, Mingliang
    Li, Qilei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
  • [7] Zero-Shot Temporal Action Detection via Vision-Language Prompting
    Nag, Sauradip
    Zhu, Xiatian
    Song, Yi-Zhe
    Xiang, Tao
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 681 - 697
  • [8] Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
    Zheng, Zangwei
    Ma, Mingyuan
    Wang, Kai
    Qin, Ziheng
    Yue, Xiangyu
    You, Yang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19068 - 19079
  • [9] EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
    Foteinopoulou, Niki Maria
    Patras, Ioannis
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [10] Inference Calibration of Vision-Language Foundation Models for Zero-Shot and Few-Shot Learning
    Hu, Minyang
    Chang, Hong
    Shan, Shiguang
    Chen, Xilin
    PATTERN RECOGNITION LETTERS, 2025, 192 : 15 - 21