Multi-Modality Global Fusion Attention Network for Visual Question Answering

被引:2
|
作者
Yang, Cheng [1 ]
Wu, Weijia [1 ]
Wang, Yuxing [1 ]
Zhou, Hong [1 ]
机构
[1] Zhejiang Univ, Engn Minist, Key Lab Biomed, Hangzhou 310027, Peoples R China
关键词
visual question answering; global attention mechanism; deep learning;
D O I
10.3390/electronics9111882
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [31] Word-to-region attention network for visual question answering
    Peng, Liang
    Yang, Yang
    Bin, Yi
    Xie, Ning
    Shen, Fumin
    Ji, Yanli
    Xu, Xing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) : 3843 - 3858
  • [32] Dynamic Co-attention Network for Visual Question Answering
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129
  • [33] Multi-stage Attention based Visual Question Answering
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9407 - 9414
  • [34] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [35] SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)
    Naik, Atharva
    Butala, Yash Parag
    Vaikunthan, Navaneethan
    Kapoor, Raghav
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23592 - 23593
  • [36] Multi-modality relation attention network for breast tumor classification
    Yang, Xiao
    Xi, Xiaoming
    Yang, Lu
    Xu, Chuanzhen
    Song, Zuoyong
    Nie, Xiushan
    Qiao, Lishan
    Li, Chenglong
    Shi, Qinglei
    Yin, Yilong
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 150
  • [37] Multi-view Attention Networks for Visual Question Answering
    Li, Min
    Bai, Zongwen
    Deng, Jie
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 788 - 794
  • [38] Multi-Modality Cross Attention Network for Image and Sentence Matching
    Wei, Xi
    Zhang, Tianzhu
    Li, Yan
    Zhang, Yongdong
    Wu, Feng
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10938 - 10947
  • [39] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    INFORMATION FUSION, 2020, 55 (55) : 116 - 126
  • [40] Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering
    Manmadhan, Sruthy
    Kovoor, Binsu C.
    IMAGE AND VISION COMPUTING, 2021, 115