DAP: DOMAIN-AWARE PROMPT LEARNING FOR VISION-AND-LANGUAGE NAVIGATION

被引：0

作者：

Liu, Ting ^{[1
]}

Hu, Yue ^{[1
]}

Wu, Wansen ^{[1
]}

Wang, Youkai ^{[1
]}

Xu, Kai ^{[1
]}

Yin, Quanjun ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Syst Engn, Changsha, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

关键词：

vision-and-language; multimodal representation;

D O I：

10.1109/ICASSP48485.2024.10446504

中图分类号：

学科分类号：

摘要：

Following language instructions to navigate in unseen environments is a challenging task for autonomous embodied agents. With strong representation capabilities, pretrained vision-and-language models are widely used in VLN. However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. To address the problem, we propose a novel and model-agnostic Domain-Aware Prompt learning (DAP) framework. For equipping the pretrained models with specific object-level and scene-level cross-modal alignment in VLN tasks, DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics. Specifically, we first generate a set of in-domain image-text pairs with the help of the CLIP model. Then we introduce soft visual prompts in the input space of the visual encoder in a pretrained model. DAP injects in-domain visual knowledge into the visual encoder of the pretrained model in an efficient way. Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.

引用

页码：2615 / 2619

页数：5

共 5 条

[1] Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Tapaswi, Makarand
Schmid, Cordelia
Laptev, Ivan
COMPUTER VISION, ECCV 2022, PT XXXIX, 2022, 13699 : 638 - 655
[2] LLM as Copilot for Coarse-Grained Vision-and-Language Navigation
Qiao, Yanyuan
Liu, Qianyi
Liu, Jiajun
Liu, Jing
Wu, Qi
COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 459 - 476
[3] FashionViL: Fashion-Focused Vision-and-Language Representation Learning
Han, Xiao
Yu, Licheng
Zhu, Xiatian
Zhang, Li
Song, Yi-Zhe
Xiang, Tao
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 634 - 651
[4] Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge
Zhan, Zhaohuan
Qin, Jinghui
Zhuo, Wei
Tan, Guang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9745 - 9756
[5] A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
Li, Yikuan
Wang, Hanyin
Luo, Yuan
2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1999 - 2004

← 1 →