Learning to Assemble Neural Module Tree Networks for Visual Grounding

被引:203
作者
Liu, Daqing [1 ]
Zhang, Hanwang [2 ]
Wu, Feng [1 ]
Zha, Zheng-Jun [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] Nanyang Technol Univ, Singapore, Singapore
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
10.1109/ICCV.2019.00477
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTREE) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTREE disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTREE consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.
引用
收藏
页码:4672 / 4681
页数:10
相关论文
共 42 条
[1]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[2]  
[Anonymous], 2018, CVPR
[3]  
[Anonymous], 2017, CVPR
[4]  
[Anonymous], 2018, Deep learning for generic object detection: A survey
[5]  
[Anonymous], 2017, ICCV
[6]  
[Anonymous], 2018, CVPR
[7]  
[Anonymous], 2016, ECCV
[8]  
[Anonymous], 2015, Arxiv.Org, DOI DOI 10.3389/FPSYG.2013.00124
[9]  
[Anonymous], 2014, EMNLP
[10]  
[Anonymous], CVPR