Visual-Semantic Graph Matching for Visual Grounding

被引：15

作者：

Jing, Chenchen ^{[1
]}

Wu, Yuwei ^{[1
]}

Pei, Mingtao ^{[1
]}

Hu, Yao ^{[2
]}

Jia, Yunde ^{[1
]}

Wu, Qi ^{[3
]}

机构：

[1] Beijing Inst Technol, Beijing, Peoples R China

[2] Alibaba Youku Cognit & Intelligent Lab, Beijing, Peoples R China

[3] Univ Adelaide, Adelaide, SA, Australia

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

关键词：

Visual Grounding; Graph Matching; Visual Scene Graph; Language Scene Graph; LANGUAGE;

D O I：

10.1145/3394171.3413902

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

引用

页码：4041 / 4050

页数：10

共 56 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[2] [Anonymous], PROC CVPR IEEE
[3] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[4] G3RAPHGROUND: Graph-based Language Grounding
Bajaj, Mohit
Wang, Lanjun
Sigal, Leonid
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4280 - 4289
[5] Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[6] Visual Grounding via Accumulated Attention
Deng, Chaorui
Wu, Qi
Wu, Qingyao
Hu, Fuyuan
Lyu, Fan
Tan, Mingkui
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
[7] Neural Sequential Phrase Grounding (SeqGROUND)
Dogan, Pelin
Sigal, Leonid
Gross, Markus
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4170 - 4179
[8] Temporal Cycle-Consistency Learning
Dwibedi, Debidatta
Aytar, Yusuf
Tompson, Jonathan
Sermanet, Pierre
Zisserman, Andrew
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1801 - 1810
[9] YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition
Guadarrama, Sergio
Krishnamoorthy, Niveda
Malkarnenkar, Girish
Venugopalan, Subhashini
Mooney, Raymond
Darrell, Trevor
Saenko, Kate
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 2712 - 2719
[10] Hong Richang, 2019, IEEE T PATTERN ANAL

← 1 2 3 4 5 6 →