Visual-Semantic Graph Matching for Visual Grounding

被引:15
作者
Jing, Chenchen [1 ]
Wu, Yuwei [1 ]
Pei, Mingtao [1 ]
Hu, Yao [2 ]
Jia, Yunde [1 ]
Wu, Qi [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Alibaba Youku Cognit & Intelligent Lab, Beijing, Peoples R China
[3] Univ Adelaide, Adelaide, SA, Australia
来源
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年
关键词
Visual Grounding; Graph Matching; Visual Scene Graph; Language Scene Graph; LANGUAGE;
D O I
10.1145/3394171.3413902
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
引用
收藏
页码:4041 / 4050
页数:10
相关论文
共 56 条
  • [1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [2] [Anonymous], PROC CVPR IEEE
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] G3RAPHGROUND: Graph-based Language Grounding
    Bajaj, Mohit
    Wang, Lanjun
    Sigal, Leonid
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4280 - 4289
  • [5] Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
  • [6] Visual Grounding via Accumulated Attention
    Deng, Chaorui
    Wu, Qi
    Wu, Qingyao
    Hu, Fuyuan
    Lyu, Fan
    Tan, Mingkui
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
  • [7] Neural Sequential Phrase Grounding (SeqGROUND)
    Dogan, Pelin
    Sigal, Leonid
    Gross, Markus
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4170 - 4179
  • [8] Temporal Cycle-Consistency Learning
    Dwibedi, Debidatta
    Aytar, Yusuf
    Tompson, Jonathan
    Sermanet, Pierre
    Zisserman, Andrew
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1801 - 1810
  • [9] YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition
    Guadarrama, Sergio
    Krishnamoorthy, Niveda
    Malkarnenkar, Girish
    Venugopalan, Subhashini
    Mooney, Raymond
    Darrell, Trevor
    Saenko, Kate
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 2712 - 2719
  • [10] Hong Richang, 2019, IEEE T PATTERN ANAL