Multi-Modality Global Fusion Attention Network for Visual Question Answering

被引:2
|
作者
Yang, Cheng [1 ]
Wu, Weijia [1 ]
Wang, Yuxing [1 ]
Zhou, Hong [1 ]
机构
[1] Zhejiang Univ, Engn Minist, Key Lab Biomed, Hangzhou 310027, Peoples R China
关键词
visual question answering; global attention mechanism; deep learning;
D O I
10.3390/electronics9111882
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [41] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
  • [42] Improving Visual Question Answering by Multimodal Gate Fusion Network
    Xiang, Shenxiang
    Chen, Qiaohong
    Fang, Xian
    Guo, Menghao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [43] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [44] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
  • [45] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [46] Fusing Attention with Visual Question Answering
    Burt, Ryan
    Cudic, Mihael
    Principe, Jose C.
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
  • [47] Multi-Attention Relation Network for Figure Question Answering
    Li, Ying
    Wu, Qingfeng
    Chen, Bin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 667 - 680
  • [48] Modular dual-stream visual fusion network for visual question answering
    Xue, Lixia
    Wang, Wenhao
    Wang, Ronggui
    Yang, Juan
    VISUAL COMPUTER, 2025, 41 (01): : 549 - 562
  • [49] A Question-Focused Multi-Factor Attention Network for Question Answering
    Kundu, Souvik
    Hwee Tou Ng
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5828 - 5835
  • [50] Two-step Joint Attention Network for Visual Question Answering
    Zhang, Weiming
    Zhang, Chunhong
    Liu, Pei
    Zhan, Zhiqiang
    Qiu, Xiaofeng
    2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM), 2017, : 136 - 143