Interpretable Visual Question Answering by Reasoning on Dependency Trees

被引:35
作者
Cao, Qingxing [1 ]
Liang, Xiaodan [1 ]
Li, Bailin [2 ,3 ]
Lin, Liang [2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou 510275, Guangdong, Peoples R China
[2] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou, Peoples R China
[3] Minist Educ, Engn Res Ctr Adv Comp Engn Software, Guangzhou 510275, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Cognition; Visualization; Layout; Logic gates; Task analysis; Knowledge discovery; Image coding; Visual question answering; image and language parsing; deep reasoning; attention model;
D O I
10.1109/TPAMI.2019.2943456
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Collaborative reasoning for understanding image-question pairs is a very critical but underexplored topic in interpretable visual question answering systems. Although very recent studies have attempted to use explicit compositional processes to assemble multiple subtasks embedded in questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, which leads to either heavy workloads or poor performance on compositional reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question; thus, our model is called a parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module that exploits the local visual evidence of each word parsed from the question, ii) a gated residual composition module that composes the previously mined evidence, and iii) a parse-tree-guided propagation module that passes the mined evidence along the parse tree. Thus, PTGRN is capable of building an interpretable visual question answering (VQA) system that gradually derives image cues following question-driven parse-tree reasoning. Experiments on relational datasets demonstrate the superiority of PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.
引用
收藏
页码:887 / 901
页数:15
相关论文
共 49 条
[31]  
Santoro A, 2017, ADV NEUR IN, V30
[32]   Where To Look: Focus Regions for Visual Question Answering [J].
Shih, Kevin J. ;
Singh, Saurabh ;
Hoiem, Derek .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4613-4621
[33]  
Tai KS, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P1556
[34]   Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge [J].
Teney, Damien ;
Anderson, Peter ;
He, Xiaodong ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4223-4232
[35]   Graph-Structured Representations for Visual Question Answering [J].
Teney, Damien ;
Liu, Lingqiao ;
van den Hengel, Anton .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3233-3241
[36]   FVQA: Fact-Based Visual Question Answering [J].
Wang, Peng ;
Wu, Qi ;
Shen, Chunhua ;
Dick, Anthony ;
van den Hengel, Anton .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (10) :2413-2427
[37]  
Xiong CM, 2016, PR MACH LEARN RES, V48
[38]   Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering [J].
Xu, Huijuan ;
Saenko, Kate .
COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 :451-466
[39]  
Yang D, 2026, P 2016 C EMP METH NA, P457, DOI [10.18653/v1/d16-1044, DOI 10.18653/V1/D16-1044, 10.18653/v1/D16-1044]
[40]   Stacked Attention Networks for Image Question Answering [J].
Yang, Zichao ;
He, Xiaodong ;
Gao, Jianfeng ;
Deng, Li ;
Smola, Alex .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :21-29