Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping

被引:11
作者
Zhang, Chao [1 ]
Li, Weiming [1 ]
Ouyang, Wanli [2 ]
Wang, Qiang [1 ]
Kim, Woo-Shik [3 ]
Hong, Sunghoon [3 ]
机构
[1] Samsung Res China, Beijing, Peoples R China
[2] Univ Sydney, Sydney, NSW, Australia
[3] Samsung Adv Inst Technol, Suwon, South Korea
来源
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年
关键词
referring expression comprehension; semantic visual relationship recognition; word2vec; word mapping;
D O I
10.1145/3343031.3351063
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Referring expression comprehension, which locates the object instance described by a natural language expression, gains increasing interests in recent years. This paper aims at improving the task from two aspects: visual feature extraction and language features extraction. For visual feature extraction, we observe that most of the previous methods utilize only relative spatial information to model the visual relationship between object pairs while discarding rich semantic relationship between objects. This makes the visual-language matching difficult when the language expression contains semantic relationship to discriminate the referred object from other objects in the image. In this work, we propose a Semantic Visual Relationship Module (SVRM) to exploit this important information. For language feature extraction, a major problem comes from the long-tail distribution of words in the expressions. Since more than half of the words appear less than 20 times in the public datasets, deep models such as LSTM tend to fail to learn accurate representations for these words. To solve this problem, we propose a word2vec based word mapping method that maps these low frequency words to high frequency words with similar meaning. Experiments show that the proposed method outperforms existing state-of-the-art methods on three referring expression comprehension datasets.
引用
收藏
页码:1258 / 1266
页数:9
相关论文
共 32 条
[1]  
Andreas J., 2016, P 2016 C N AM CHAPT, P1545
[2]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[3]  
Galleguillos C, 2008, PROC CVPR IEEE, P3552
[4]   Learning to Reason: End-to-End Module Networks for Visual Question Answering [J].
Hu, Ronghang ;
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Saenko, Kate .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :804-813
[5]   Modeling Relationships in Referential Expressions with Compositional Modular Networks [J].
Hu, Ronghang ;
Rohrbach, Marcus ;
Andreas, Jacob ;
Darrell, Trevor ;
Saenko, Kate .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4418-4427
[6]   Natural Language Object Retrieval [J].
Hu, Ronghang ;
Xu, Huazhe ;
Rohrbach, Marcus ;
Feng, Jiashi ;
Saenko, Kate ;
Darrell, Trevor .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4555-4564
[7]  
Johnson J, 2015, PROC CVPR IEEE, P3668, DOI 10.1109/CVPR.2015.7298990
[8]  
Johnson O, 2017, IEEE INT SYMP INFO, P898, DOI 10.1109/ISIT.2017.8006658
[9]  
Kingma DP, 2014, ARXIV
[10]   BabyTalk: Understanding and Generating Simple Image Descriptions [J].
Kulkarni, Girish ;
Premraj, Visruth ;
Ordonez, Vicente ;
Dhar, Sagnik ;
Li, Siming ;
Choi, Yejin ;
Berg, Alexander C. ;
Berg, Tamara L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (12) :2891-2903