Multi-Modality Global Fusion Attention Network for Visual Question Answering

被引：2

作者：

Yang, Cheng ^{[1
]}

Wu, Weijia ^{[1
]}

Wang, Yuxing ^{[1
]}

Zhou, Hong ^{[1
]}

机构：

[1] Zhejiang Univ, Engn Minist, Key Lab Biomed, Hangzhou 310027, Peoples R China

来源：

ELECTRONICS | 2020年 / 9卷 / 11期

关键词：

visual question answering; global attention mechanism; deep learning;

D O I：

10.3390/electronics9111882

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.

引用

页码：1 / 12

页数：12

共 50 条

[21] Multi-Granularity Relational Attention Network for Audio-Visual Question Answering
Li, Linjun
Jin, Tao
Lin, Wang
Jiang, Hao
Pan, Wenwen
Wang, Jian
Xiao, Shuwen
Xia, Yan
Jiang, Weihao
Zhao, Zhou
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7080 - 7094
[22] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 31516 - 31524
[23] Cross-modality co-attention networks for visual question answering
Han, Dezhi
Zhou, Shuli
Li, Kuan Ching
de Mello, Rodrigo Fernandes
SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
[24] Cross-modality co-attention networks for visual question answering
Dezhi Han
Shuli Zhou
Kuan Ching Li
Rodrigo Fernandes de Mello
Soft Computing, 2021, 25 : 5411 - 5421
[25] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
[26] Latent Attention Network With Position Perception for Visual Question Answering
Zhang, Jing
Liu, Xiaoqiang
Wang, Zhe
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 5059 - 5069
[27] Deep Attention Neural Tensor Network for Visual Question Answering
Bai, Yalong
Fu, Jianlong
Zhao, Tiejun
Mei, Tao
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37
[28] Word-to-region attention network for visual question answering
Liang Peng
Yang Yang
Yi Bin
Ning Xie
Fumin Shen
Yanli Ji
Xing Xu
Multimedia Tools and Applications, 2019, 78 : 3843 - 3858
[29] Latent Attention Network With Position Perception for Visual Question Answering
Zhang, Jing
Liu, Xiaoqiang
Wang, Zhe
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 11
[30] Deep Modular Bilinear Attention Network for Visual Question Answering
Yan, Feng
Silamu, Wushouer
Li, Yanbing
SENSORS, 2022, 22 (03)

← 1 2 3 4 5 →