VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

被引:6
作者
Yokoyama, Naoki [1 ,2 ]
Ha, Sehoon [2 ]
Batra, Dhruv [2 ]
Wang, Jiuguang [1 ]
Bucher, Bernadette [1 ]
机构
[1] Boston Dynam AI Inst, Boston, MA USA
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024 | 2024年
关键词
D O I
10.1109/ICRA57147.2024.10610712
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real world deployment can be viewed at naoki.io/vlfm.
引用
收藏
页码:42 / 48
页数:7
相关论文
共 50 条
  • [31] Decoupling Zero-Shot Semantic Segmentation
    Ding, Jian
    Xue, Nan
    Xia, Gui-Song
    Dai, Dengxin
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11573 - 11582
  • [32] Zero-Shot Semantic Parsing for Instructions
    Givoli, Ofer
    Reichart, Roi
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4454 - 4464
  • [33] Semantic Autoencoder for Zero-Shot Learning
    Kodirov, Elyor
    Xiang, Tao
    Gong, Shaogang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4447 - 4456
  • [34] Vision-language navigation: a survey and taxonomy
    Wansen Wu
    Tao Chang
    Xinmeng Li
    Quanjun Yin
    Yue Hu
    Neural Computing and Applications, 2024, 36 : 3291 - 3316
  • [35] Vision-language navigation: a survey and taxonomy
    Wu, Wansen
    Chang, Tao
    Li, Xinmeng
    Yin, Quanjun
    Hu, Yue
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07) : 3291 - 3316
  • [36] Instance-Level Semantic Maps for Vision Language Navigation
    Nanwani, Laksh
    Agarwal, Anmol
    Jain, Kanishk
    Prabhakar, Raghav
    Monis, Aaron
    Mathur, Aditya
    Jatavallabhula, Krishna Murthy
    Hafez, A. H. Abdul
    Gandhi, Vineet
    Krishna, K. Madhava
    2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, 2023, : 507 - 512
  • [37] Zero-Shot Recommendation as Language Modeling
    Sileo, Damien
    Vossen, Wout
    Raymaekers, Robbe
    ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 223 - 230
  • [38] Zero-Shot Object Goal Visual Navigation
    Zhao, Qianfan
    Zhang, Lu
    He, Bin
    Qiao, Hong
    Liu, Zhiyong
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 2025 - 2031
  • [39] Towards Zero-shot Language Modeling
    Ponti, Edoardo M.
    Vulic, Ivan
    Cotterell, Ryan
    Reichart, Roi
    Korhonen, Anna
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2900 - +
  • [40] Learning semantic ambiguities for zero-shot learning
    Hanouti, Celina
    Le Borgne, Herve
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (26) : 40745 - 40759