Multi-Modality Global Fusion Attention Network for Visual Question Answering

被引:2
|
作者
Yang, Cheng [1 ]
Wu, Weijia [1 ]
Wang, Yuxing [1 ]
Zhou, Hong [1 ]
机构
[1] Zhejiang Univ, Engn Minist, Key Lab Biomed, Hangzhou 310027, Peoples R China
关键词
visual question answering; global attention mechanism; deep learning;
D O I
10.3390/electronics9111882
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [1] Multi-modality Latent Interaction Network for Visual Question Answering
    Gao, Peng
    You, Haoxuan
    Zhang, Zhanpeng
    Wang, Xiaogang
    Li, Hongsheng
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5824 - 5834
  • [2] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [3] MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering
    Feng, Junyi
    Gong, Ping
    Qiu, Guanghui
    ICVIP 2019: PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, 2019, : 143 - 147
  • [4] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [5] Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
    Cao, Jianjian
    Qin, Xiameng
    Zhao, Sanyuan
    Shen, Jianbing
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022,
  • [6] Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering
    Gao, Peng
    Jiang, Zhengkai
    You, Haoxuan
    Lu, Pan
    Hoi, Steven
    Wang, Xiaogang
    Li, Hongsheng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6632 - 6641
  • [7] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [8] A MULTI-MODALITY FUSION NETWORK BASED ON ATTENTION MECHANISM FOR BRAIN TUMOR SEGMENTATION
    Zhou, Tongxue
    Ruan, Su
    Guo, Yu
    Canu, Stephane
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 377 - 380
  • [9] Multi-modality Fusion Network for Action Recognition
    Huang, Kai
    Qin, Zheng
    Xu, Kaiping
    Ye, Shuxiong
    Wang, Guolong
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 139 - 149
  • [10] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,