Interpretable Visual Question Answering by Reasoning on Dependency Trees

被引:35
作者
Cao, Qingxing [1 ]
Liang, Xiaodan [1 ]
Li, Bailin [2 ,3 ]
Lin, Liang [2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou 510275, Guangdong, Peoples R China
[2] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou, Peoples R China
[3] Minist Educ, Engn Res Ctr Adv Comp Engn Software, Guangzhou 510275, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Cognition; Visualization; Layout; Logic gates; Task analysis; Knowledge discovery; Image coding; Visual question answering; image and language parsing; deep reasoning; attention model;
D O I
10.1109/TPAMI.2019.2943456
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Collaborative reasoning for understanding image-question pairs is a very critical but underexplored topic in interpretable visual question answering systems. Although very recent studies have attempted to use explicit compositional processes to assemble multiple subtasks embedded in questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, which leads to either heavy workloads or poor performance on compositional reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question; thus, our model is called a parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module that exploits the local visual evidence of each word parsed from the question, ii) a gated residual composition module that composes the previously mined evidence, and iii) a parse-tree-guided propagation module that passes the mined evidence along the parse tree. Thus, PTGRN is capable of building an interpretable visual question answering (VQA) system that gradually derives image cues following question-driven parse-tree reasoning. Experiments on relational datasets demonstrate the superiority of PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.
引用
收藏
页码:887 / 901
页数:15
相关论文
共 49 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[3]  
Andreas Jacob, 2016, P NAACL HLT, P1545, DOI DOI 10.18653/V1/N16-1181
[4]  
[Anonymous], 2018, INT C LEARN REPR
[5]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[6]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[7]   Counting Everyday Objects in Everyday ScenesCounting Everyday Objects in Everyday Scenes [J].
Chattopadhyay, Prithvijit ;
Vedantam, Ramakrishna ;
Selvaraju, Ramprasaath R. ;
Batra, Dhruv ;
Parikh, Devi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4428-4437
[8]  
Chen D., 2014, P 2014 C EMP METH NA, P740, DOI DOI 10.3115/V1/D14-1082
[9]  
Chung J., 2014, EMPIRICAL EVALUATION
[10]  
Feng J., ABS160401485