Self-supervised learning of visual features through embedding images into text topic spaces

被引:55
作者
Gomez, Lluis [1 ]
Patel, Yash [2 ]
Rusinol, Marcal [1 ]
Karatzas, Dimosthenis [1 ]
Jawahar, C. V. [2 ]
机构
[1] UAB, Comp Vis Ctr, Barcelona, Spain
[2] IIIT Hyderabad, KCIS, CVIT, Hyderabad, Andhra Prades, India
来源
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年
关键词
D O I
10.1109/CVPR.2017.218
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multimodal ( text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multimodal retrieval compared to recent self-supervised or natural-supervised approaches.
引用
收藏
页码:2017 / 2026
页数:10
相关论文
共 46 条
[1]  
[Anonymous], ICLR
[2]  
[Anonymous], IEEE T PATTERN ANAL
[3]  
[Anonymous], 2011, ADV NEURAL INFORM PR
[4]  
[Anonymous], ICLR
[5]  
[Anonymous], 2015, ICLR
[6]  
[Anonymous], 2016, ICLR
[7]  
[Anonymous], CVPR
[8]  
[Anonymous], MACHINE LEARNING
[9]  
[Anonymous], CORR
[10]  
[Anonymous], CVPR