Local self-attention in transformer for visual question answering

被引:33
作者
Shen, Xiang [1 ]
Han, Dezhi [1 ]
Guo, Zihan [1 ]
Chen, Chongqing [1 ]
Hua, Jie [2 ]
Luo, Gaofeng [3 ]
机构
[1] Shanghai Maritime Univ, Coll Informat Engn, 1550 Haigang Ave, Shanghai 201306, Peoples R China
[2] Univ Technol, TD Sch, Ultimo, NSW 2007, Australia
[3] Shaoyang Univ, Coll Informat Engn, Shaoyang 422099, Peoples R China
基金
中国国家自然科学基金; 上海市自然科学基金;
关键词
Transformer; Local self-attention; Grid; regional visual features; Visual question answering;
D O I
10.1007/s10489-022-04355-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at
引用
收藏
页码:16706 / 16723
页数:18
相关论文
共 56 条
  • [1] Image captioning model using attention and object features to mimic human image understanding
    Al-Malla, Muhammad Abdelhadie
    Jafar, Assef
    Ghneim, Nada
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [5] Carion N., 2020, EUR C COMP VIS, DOI [10.1007/978-3-030-58452-8, 10., DOI 10.1007/978-3-030-58452-813]
  • [6] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin -Chen
    [J]. PATTERN RECOGNITION, 2022, 132
  • [7] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
    Dong, Xiaoyi
    Bao, Jianmin
    Chen, Dongdong
    Zhang, Weiming
    Yu, Nenghai
    Yuan, Lu
    Chen, Dong
    Guo, Baining
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12114 - 12124
  • [8] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
    Duy-Kien Nguyen
    Okatani, Takayuki
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
  • [9] Stacked Latent Attention for Multimodal Reasoning
    Fan, Haoqi
    Zhou, Jiatong
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1072 - 1080
  • [10] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens
    Fang, Jiemin
    Xie, Lingxi
    Wang, Xinggang
    Zhang, Xiaopeng
    Liu, Wenyu
    Tian, Qi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12053 - 12062