Local Slot Attention for Vision-and-Language Navigation

被引:1
作者
Zhuang, Yifeng [1 ]
Sun, Qiang [1 ]
Fu, Yanwei [2 ]
Chen, Lifeng [1 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022 | 2022年
关键词
vision-and-language navigation; slot attention; local attention;
D O I
10.1145/3512527.3531366
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 30 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
    Chen, Howard
    Suhr, Alane
    Misra, Dipendra
    Snavely, Noah
    Artzi, Yoav
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12530 - 12539
  • [3] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [4] Fried Daniel, 2018, PROC NEURIPS
  • [5] Hao Weituo, 2020, P CVPR
  • [6] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [7] VLN(sic)BERT: A Recurrent Vision-and-Language BERT for Navigation
    Hong, Yicong
    Wu, Qi
    Qi, Yuankai
    Rodriguez-Opazo, Cristian
    Gould, Stephen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1643 - 1653
  • [8] Transferable Representation Learning in Vision-and-Language Navigation
    Huang, Haoshuo
    Jain, Vihan
    Mehta, Harsh
    Ku, Alexander
    Magalhaes, Gabriel
    Baldridge, Jason
    Ie, Eugene
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
  • [9] Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
    Ke, Liyiming
    Li, Xiujun
    Bisk, Yonatan
    Holtzman, Ari
    Gan, Zhe
    Liu, Jingjing
    Gao, Jianfeng
    Choi, Yejin
    Srinivasa, Siddhartha
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6734 - 6742
  • [10] Krantz Jacob, 2020, ECCV PROC