DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation

被引:38
作者
Zhang, Weifeng [1 ]
Yu, Jing [2 ]
Zhao, Wenhong [3 ]
Ran, Chuan [4 ]
机构
[1] Jiaxing Univ, Jiaxing, Zhejiang, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[3] Jiaxing Univ, Nanhu Coll, Jiaxing, Zhejiang, Peoples R China
[4] IBM Corp, Res Triangle Pk, NC 27709 USA
关键词
Multimodal reasoning and fusion; Visual Question Answering; Explainable artificial intelligence; ATTENTION;
D O I
10.1016/j.inffus.2021.02.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA), which aims to answer questions in natural language according to the content of image, has attracted extensive attention from artificial intelligence community. Multimodal reasoning and fusion is a central component in recent VQA models. However, most existing VQA models are still insufficient to reason and fuse clues from multiple modalities. Furthermore, they are lack of interpretability since they disregard the explanations. We argue that reasoning and fusing multiple relations implied in multimodalities contributes to more accurate answers and explanations. In this paper, we design an effective multimodal reasoning and fusion model to achieve fine-grained multimodal reasoning and fusion. Specifically, we propose Multi-Graph Reasoning and Fusion (MGRF) layer, which adopts pre-trained semantic relation embeddings, to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. The MGRF layers can be further stacked in depth to form Deep Multimodal Reasoning and Fusion Network (DMRFNet) to sufficiently reason and fuse multimodal relations. Furthermore, an explanation generation module is designed to justify the predicted answer. This justification reveals the motive of the model?s decision and enhances the model?s interpretability. Quantitative and qualitative experimental results on VQA 2.0, and VQA-E datasets show DMRFNet?s effectiveness.
引用
收藏
页码:70 / 79
页数:10
相关论文
共 62 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
Ankur B., 2018, ARXIV PREPRINT ARXIV
[4]  
[Anonymous], 2017, IEEE T NEURAL NETW L, DOI DOI 10.1109/TNNLS.2016.2636185
[5]  
[Anonymous], 2017, ARXIV170403162
[6]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[7]  
Ba Jimmy Lei, 2016, Layer normalization
[8]  
Banerjee S., 2005, IEEMMT, P65
[9]  
Battaglia P. W., 2018, ARXIV
[10]  
Ben-Younes H, 2019, AAAI CONF ARTIF INTE, P8102