Learning to Assemble Neural Module Tree Networks for Visual Grounding

被引：203

作者：

Liu, Daqing ^{[1
]}

Zhang, Hanwang ^{[2
]}

Wu, Feng ^{[1
]}

Zha, Zheng-Jun ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Peoples R China

[2] Nanyang Technol Univ, Singapore, Singapore

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

D O I：

10.1109/ICCV.2019.00477

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTREE) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTREE disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTREE consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.

引用

页码：4672 / 4681

页数：10

共 42 条

[1] Neural Module Networks [J].

Andreas, Jacob ;

Rohrbach, Marcus ;

Darrell, Trevor ;

Klein, Dan .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48

[2]

[Anonymous], 2018, CVPR

[3]

[Anonymous], 2017, CVPR

[4]

[Anonymous], 2018, Deep learning for generic object detection: A survey

[5]

[Anonymous], 2017, ICCV

[6]

[Anonymous], 2018, CVPR

[7]

[Anonymous], 2016, ECCV

[8]

[Anonymous], 2015, Arxiv.Org, DOI DOI 10.3389/FPSYG.2013.00124

[9]

[Anonymous], 2014, EMNLP

[10]

[Anonymous], CVPR

← 1 2 3 4 5 →