Modeling Relationships in Referential Expressions with Compositional Modular Networks

被引:218
作者
Hu, Ronghang [1 ]
Rohrbach, Marcus [1 ]
Andreas, Jacob [1 ]
Darrell, Trevor [1 ]
Saenko, Kate [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Boston Univ, Boston, MA 02215 USA
来源
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年
关键词
D O I
10.1109/CVPR.2017.470
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
People often refer to entities in an image in terms of their relationships with other entities. For example, the black cat sitting under the table refers to both a black cat entity and its relationship with another table entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.
引用
收藏
页码:4418 / 4427
页数:10
相关论文
共 35 条
  • [1] ABADI M, 2015, TENSORFLOW LARGE SCA, DOI DOI 10.48550/ARXIV.1605.08695
  • [2] Neural Module Networks
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Klein, Dan
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 39 - 48
  • [3] [Anonymous], 2016, P 2016 C N AM CHAPT, DOI DOI 10.18653/V1/N16-1181
  • [4] [Anonymous], 2015, ARXIV151103416
  • [5] Multiscale Combinatorial Grouping
    Arbelaez, Pablo
    Pont-Tuset, Jordi
    Barron, Jonathan T.
    Marques, Ferran
    Malik, Jitendra
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 328 - 335
  • [6] Dai J., 2016, ADV NEURAL INFORM PR, V29, P379, DOI [DOI 10.1016/J.JPOWSOUR.2007.02.075, DOI 10.48550/ARXIV.1605.06409, DOI 10.1109/CVPR.2017.690]
  • [7] Fukui A, 2016, ARXIV160601847, P457, DOI [10.18653/v1/D16-1044, DOI 10.18653/V1/D16-1044]
  • [8] Girshick R., 2014, IEEE C COMP VIS PATT, DOI [DOI 10.1109/CVPR.2014.81, 10.1109/CVPR.2014.81]
  • [9] Hihn J, 2016, AEROSP CONF PROC
  • [10] Natural Language Object Retrieval
    Hu, Ronghang
    Xu, Huazhe
    Rohrbach, Marcus
    Feng, Jiashi
    Saenko, Kate
    Darrell, Trevor
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4555 - 4564