Local self-attention in transformer for visual question answering

被引:0
作者
Xiang Shen
Dezhi Han
Zihan Guo
Chongqing Chen
Jie Hua
Gaofeng Luo
机构
[1] Shanghai Maritime University,College of Information Engineering
[2] University of Technology,TD School
[3] Shaoyang University,College of Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Transformer; Local self-attention; Grid/regional visual features; Visual question answering;
D O I
暂无
中图分类号
学科分类号
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.
引用
收藏
页码:16706 / 16723
页数:17
相关论文
共 72 条
[1]  
Vaswani A(2017)Attention is all you need Adv Neural Inf Process Syst (NIPS) 30 5998-6008
[2]  
Shazeer N(2022)Contextual ensemble network for semantic segmentation Pattern Recogn 122 108290-16
[3]  
Parmar N(2022)Image captioning model using attention and object features to mimic human image understanding J Big Data 9 1-570
[4]  
Uszkoreit J(2019)Multi-scale deep context convolutional neural networks for semantic segmentation World Wide Web 22 555-5959
[5]  
Jones L(2018)Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering IEEE Trans Neural Netw Learn Syst 29 5947-15919
[6]  
Gomez AN(2021)Transformer in transformer Adv Neural Inf Process Syst (NIPS) 34 15908-796
[7]  
Kaiser Ł(2022)CAAN: Context-aware attention network for visual question answering Pattern Recogn 132 108980-73
[8]  
Polosukhin I(2021)Dual self-attention with co-attention networks for visual question answering Pattern Recogn 117 107956-1273
[9]  
Zhou Q(2022)Dual self-guided attention with sparse question networks for visual question answering IEICE Trans Inf Syst 105 785-3209
[10]  
Wu X(2017)Visual genome: Connecting language and vision using crowdsourced dense image annotations Int J Comput Vis (IJCV) 123 32-undefined