Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering

被引:1
|
作者
Zhang, Haotian [1 ]
Wu, Wei [1 ]
机构
[1] Inner Mongolia Univ, Comp Sci Dept, Hohhot, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; transformer; multi-modal task; cross-modal gate fusion;
D O I
10.1109/IJCNN55064.2022.9891887
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an image and an opened-ended question related to the picture, the Visual Question Answering (VQA) model aims to provide the right answer to the question for the image. This is a challenging task that requires a fine-grained simultaneous realization of the visual content of the image and the textual content of the question. However, most current models ignore the region-to-word interaction and the noise effect of irrelevant words. This paper proposes a novel model called Transformer Gate Attention Model (TGAM) to capture the inter-modal information dependence to solve the above problems. Specifically, TGAM is composed of Adaptive Gate (AG) and Parallel Transformer Module (PTM). AG fuses information between different modalities and reduces noise, and PTM obtains more advanced cross-modal feature representation. Many qualitative and quantitative experiments were conducted on VQA-v2 dataset to verify the validity of TGAM. And extensive ablation studies have been conducted to explore the reasons behind the effectiveness of TGAM. Experimental results show that the performance of TGAM is significantly better than previous state-of-the-art technologies. Our best model achieved 71.28% overall accuracy on the test-dev set and 71.6% on the test-std set.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [2] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [3] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [4] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
    Zhou, Yiyi
    Ren, Tianhe
    Zhu, Chaoyang
    Sun, Xiaoshuai
    Liu, Jianzhuang
    Ding, Xinghao
    Xu, Mingliang
    Ji, Rongrong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
  • [5] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [6] ASAM: Asynchronous Self-Attention Model for Visual Question Answering
    Liu, Han
    Han, Dezhi
    Zhang, Shukai
    Shi, Jingya
    Wu, Huafeng
    Zhou, Yachao
    Li, Kuan-Ching
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2025, 22 (01)
  • [7] Cascading Attention Visual Question Answering Model Based on Graph Structure
    Zhang, Haoyu
    Zhang, De
    Computer Engineering and Applications, 2023, 59 (06) : 155 - 161
  • [8] CAT: Re-Conv Attention in Transformer for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1471 - 1477
  • [9] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
    Sharma, Himanshu
    Jalal, Anand Singh
    NEURAL PROCESSING LETTERS, 2022, 54 (01) : 709 - 730
  • [10] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
    Himanshu Sharma
    Anand Singh Jalal
    Neural Processing Letters, 2022, 54 : 709 - 730