Self-supervised learning of visual features through embedding images into text topic spaces

被引：55

作者：

Gomez, Lluis ^{[1
]}

Patel, Yash ^{[2
]}

Rusinol, Marcal ^{[1
]}

Karatzas, Dimosthenis ^{[1
]}

Jawahar, C. V. ^{[2
]}

机构：

[1] UAB, Comp Vis Ctr, Barcelona, Spain

[2] IIIT Hyderabad, KCIS, CVIT, Hyderabad, Andhra Prades, India

来源：

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年

关键词：

D O I：

10.1109/CVPR.2017.218

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multimodal ( text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multimodal retrieval compared to recent self-supervised or natural-supervised approaches.

引用

页码：2017 / 2026

页数：10