WebVLN: Vision-and-Language Navigation on Websites

被引:0
作者
Chen, Qi [1 ]
Pitawela, Dileepa [1 ]
Zhao, Chongyang [1 ]
Zhou, Gengze [1 ]
Chen, Hsiang-Ting [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2 | 2024年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contains rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the new WebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.
引用
收藏
页码:1165 / 1173
页数:9
相关论文
共 40 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] [Anonymous], 1994, P ANN M ASS COMP LIN
  • [3] [Anonymous], 2014, P WORKSH INT LANG LE
  • [4] A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
    Burns, Andrea
    Arsan, Deniz
    Agrawal, Sanjna
    Kumar, Ranjitha
    Saenko, Kate
    Plummer, Bryan A.
    [J]. COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 312 - 328
  • [5] WebQA: Multihop and Multimodal QA
    Chang, Yingshan
    Cao, Guihong
    Narang, Mridu
    Gao, Jianfeng
    Suzuki, Hisami
    Bisk, Yonatan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16474 - 16483
  • [6] Chen S., 2021, Advances in Neural Information Processing Systems (NeurIPS), P5834
  • [7] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Schmid, Cordelia
    Laptev, Ivan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16516 - 16526
  • [8] Deng X, 2023, Arxiv, DOI arXiv:2306.06070
  • [9] Furuta H, 2024, Arxiv, DOI arXiv:2305.11854
  • [10] Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
    Gao, Chen
    Chen, Jinyu
    Liu, Si
    Wang, Luting
    Zhang, Qiong
    Wu, Qi
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3063 - 3072