Phrase Localization Without Paired Training Examples

被引:45
作者
Wang, Josiah [1 ]
Specia, Lucia [1 ]
机构
[1] Imperial Coll London, London, England
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
基金
欧盟地平线“2020”;
关键词
D O I
10.1109/ICCV.2019.00476
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Localizing phrases in images is an important part of image understanding and can be useful in many applications that require mappings between textual and visual information. Existing work attempts to learn these mappings from examples of phrase-image region correspondences (strong supervision) or from phrase-image pairs (weak supervision). We postulate that such paired annotations are unnecessary, and propose the first method for the phrase localization problem where neither training procedure nor paired, task-specific data is required. Our method is simple but effective: we use off-the-shelf approaches to detect objects, scenes and colours in images, and explore different approaches to measure semantic similarity between the categories of detected visual elements and words in phrases. Experiments on two well-known phrase localization datasets show that this approach surpasses all weakly supervised methods by a large margin and performs very competitively to strongly supervised methods, and can thus be considered a strong baseline to the task. The non-paired nature of our method makes it applicable to any domain and where no paired phrase localization annotation is available.
引用
收藏
页码:4662 / 4671
页数:10
相关论文
共 40 条
[1]  
[Anonymous], 2015, Arxiv.Org, DOI DOI 10.3389/FPSYG.2013.00124
[2]  
Berg T., 2014, EMNLP, P787
[3]   Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].
Chen, Kan ;
Gao, Jiyang ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050
[4]   MSRC: multimodal spatial regression with semantic context for phrase grounding [J].
Chen, Kan ;
Kovvuri, Rama ;
Gao, Jiyang ;
Nevatia, Ram .
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2018, 7 (01) :17-28
[5]   Query-guided Regression Network with Context Policy for Phrase Grounding [J].
Chen, Kan ;
Kovvuri, Rama ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :824-832
[6]  
Chen Xinlei, 2015, CoRR
[7]   The Pascal Visual Object Classes (VOC) Challenge [J].
Everingham, Mark ;
Van Gool, Luc ;
Williams, Christopher K. I. ;
Winn, John ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338
[8]  
Fellbaum C, 1998, LANG SPEECH & COMMUN, P1
[9]  
Girshick R., 2014, IEEE COMP SOC C COMP, DOI [10.1109/CVPR.2014.81, DOI 10.1109/CVPR.2014.81]
[10]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448