DAP: DOMAIN-AWARE PROMPT LEARNING FOR VISION-AND-LANGUAGE NAVIGATION

被引:0
|
作者
Liu, Ting [1 ]
Hu, Yue [1 ]
Wu, Wansen [1 ]
Wang, Youkai [1 ]
Xu, Kai [1 ]
Yin, Quanjun [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
vision-and-language; multimodal representation;
D O I
10.1109/ICASSP48485.2024.10446504
中图分类号
学科分类号
摘要
Following language instructions to navigate in unseen environments is a challenging task for autonomous embodied agents. With strong representation capabilities, pretrained vision-and-language models are widely used in VLN. However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. To address the problem, we propose a novel and model-agnostic Domain-Aware Prompt learning (DAP) framework. For equipping the pretrained models with specific object-level and scene-level cross-modal alignment in VLN tasks, DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics. Specifically, we first generate a set of in-domain image-text pairs with the help of the CLIP model. Then we introduce soft visual prompts in the input space of the visual encoder in a pretrained model. DAP injects in-domain visual knowledge into the visual encoder of the pretrained model in an efficient way. Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.
引用
收藏
页码:2615 / 2619
页数:5
相关论文
共 5 条
  • [1] Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Schmid, Cordelia
    Laptev, Ivan
    COMPUTER VISION, ECCV 2022, PT XXXIX, 2022, 13699 : 638 - 655
  • [2] LLM as Copilot for Coarse-Grained Vision-and-Language Navigation
    Qiao, Yanyuan
    Liu, Qianyi
    Liu, Jiajun
    Liu, Jing
    Wu, Qi
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 459 - 476
  • [3] FashionViL: Fashion-Focused Vision-and-Language Representation Learning
    Han, Xiao
    Yu, Licheng
    Zhu, Xiatian
    Zhang, Li
    Song, Yi-Zhe
    Xiang, Tao
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 634 - 651
  • [4] Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge
    Zhan, Zhaohuan
    Qin, Jinghui
    Zhuo, Wei
    Tan, Guang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9745 - 9756
  • [5] A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
    Li, Yikuan
    Wang, Hanyin
    Luo, Yuan
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1999 - 2004