Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering

被引:1
作者
Zhang, Haotian [1 ]
Wu, Wei [1 ]
机构
[1] Inner Mongolia Univ, Comp Sci Dept, Hohhot, Peoples R China
来源
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年
基金
中国国家自然科学基金;
关键词
Visual question answering; transformer; multi-modal task; cross-modal gate fusion;
D O I
10.1109/IJCNN55064.2022.9891887
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an image and an opened-ended question related to the picture, the Visual Question Answering (VQA) model aims to provide the right answer to the question for the image. This is a challenging task that requires a fine-grained simultaneous realization of the visual content of the image and the textual content of the question. However, most current models ignore the region-to-word interaction and the noise effect of irrelevant words. This paper proposes a novel model called Transformer Gate Attention Model (TGAM) to capture the inter-modal information dependence to solve the above problems. Specifically, TGAM is composed of Adaptive Gate (AG) and Parallel Transformer Module (PTM). AG fuses information between different modalities and reduces noise, and PTM obtains more advanced cross-modal feature representation. Many qualitative and quantitative experiments were conducted on VQA-v2 dataset to verify the validity of TGAM. And extensive ablation studies have been conducted to explore the reasons behind the effectiveness of TGAM. Experimental results show that the performance of TGAM is significantly better than previous state-of-the-art technologies. Our best model achieved 71.28% overall accuracy on the test-dev set and 71.6% on the test-std set.
引用
收藏
页数:7
相关论文
共 43 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [3] Ba JL., 2016, ARXIV
  • [4] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [5] Chen K, 2016, Arxiv, DOI [arXiv:1511.05960, DOI 10.48550/ARXIV.1511.05960,ARXIV]
  • [6] Chorowski J, 2015, ADV NEUR IN, V28
  • [7] Long-Term Recurrent Convolutional Networks for Visual Recognition and Description
    Donahue, Jeff
    Hendricks, Lisa Anne
    Rohrbach, Marcus
    Venugopalan, Subhashini
    Guadarrama, Sergio
    Saenko, Kate
    Darrell, Trevor
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) : 677 - 691
  • [8] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
    Duy-Kien Nguyen
    Okatani, Takayuki
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
  • [9] Fukui A., 2016, arXiv
  • [10] Gao HY, 2015, ADV NEUR IN, V28