Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

被引：26

作者：

Cao, Jianjian ^{[1
]}

Qin, Xiameng ^{[2
]}

Zhao, Sanyuan ^{[1
]}

Shen, Jianbing ^{[3
]}

机构：

[1] Beijing Inst Technol, Dept Comp Sci, Beijing 100081, Peoples R China

[2] Baidu Inc, Beijing 100193, Peoples R China

[3] Univ Macau, Dept Comp & Informat Sci, State Key Lab Internet Things Smart City, Macau, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2025年 / 36卷 / 03期

基金：

中国国家自然科学基金;

关键词：

relational reasoning; visual question answering (VQA); Graph matching attention (GMA); NETWORKS;

D O I：

10.1109/TNNLS.2021.3135655

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

引用

页码：4160 / 4171

页数：12

共 66 条

[1]

Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1002/ett.70087, 10.1109/CVPR.2018.00636]

[2] Neural Module Networks [J].

Andreas, Jacob ;

Rohrbach, Marcus ;

Darrell, Trevor ;

Klein, Dan .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48

[3] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[4]

Ben-Younes H, 2019, AAAI CONF ARTIF INTE, P8102

[5] MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].

Ben-younes, Hedi ;

Cadene, Remi ;

Cord, Matthieu ;

Thome, Nicolas .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639

[6] MUREL: Multimodal Relational Reasoning for Visual Question Answering [J].

Cadene, Remi ;

Ben-younes, Hedi ;

Cord, Matthieu ;

Thome, Nicolas .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1989-1998

[7] Counterfactual Samples Synthesizing for Robust Visual Question Answering [J].

Chen, Long ;

Yan, Xin ;

Xiao, Jun ;

Zhang, Hanwang ;

Pu, Shiliang ;

Zhuang, Yueting .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10797-10806

[8]

Chen Xinlei., 2015, CoRR abs/1504.00325

[9]

Chung Junyoung., 2014, Corr

[10] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering [J].

Duy-Kien Nguyen ;

Okatani, Takayuki .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6087-6096

← 1 2 3 4 5 6 7 →