TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

被引:76
作者
Zhou, Yiyi [1 ,2 ]
Ren, Tianhe [1 ,2 ]
Zhu, Chaoyang [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Liu, Jianzhuang [3 ]
Ding, Xinghao [2 ]
Xu, Mingliang [4 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Sch Informat, Media Analyt & Comp Lab, Xiamen, Peoples R China
[2] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[3] Huawei Technol, Noahs Ark Lab, Shenzhen, Peoples R China
[4] Zhengzhou Univ, Zhengzhou, Peoples R China
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
D O I
10.1109/ICCV48922.2021.00208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks. However, in tasks like Visual Question Answering (VQA) and Referring Expression Comprehension (REC), the multimodal prediction often requires visual information from macro- to micro-views. Therefore, how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue. In this paper, we propose an example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue(1). Specifically, in TRAR, each visual Transformer layer is equipped with a routing module with different attention spans. The model can dynamically select the corresponding attentions based on the output of the previous inference step, so as to formulate the optimal routing path for each example. Notably, with careful designs, TRAR can reduce the additional computation and memory overhead to almost negligible. To validate TRAR, we conduct extensive experiments on five benchmark datasets of VQA and REC, and achieve superior performance gains than the standard Transformers and a bunch of state-of-the-art methods.
引用
收藏
页码:2054 / 2064
页数:11
相关论文
共 81 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] [Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.470
  • [3] [Anonymous], 2018, CVPR, DOI DOI 10.1109/CVPR.2018.00522
  • [4] [Anonymous], 2017, ARXIV170309844
  • [5] [Anonymous], 2018, COMPUTER VISION PATT, DOI DOI 10.1109/CVPR.2018.00447
  • [6] [Anonymous], 33 AAAI C ART INT
  • [7] [Anonymous], 2018, CVPR, DOI DOI 10.1109/CVPR.2018.00444
  • [8] [Anonymous], 2020, ACM INT C MULT, DOI DOI 10.1109/IPEMC-ECCEASIA48364.2020.9367932
  • [9] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [10] Bengio Emmanuel, 2015, ARXIV151106297