Picture it in your mind: generating high level visual representations from textual descriptions

被引:18
作者
Carrara, Fabio [1 ]
Esuli, Andrea [1 ]
Fagni, Tiziano [1 ]
Falchi, Fabrizio [1 ]
Fernandez, Alejandro Moreo [1 ]
机构
[1] CNR, ISTI, Via G Moruzzi 1, I-56124 Pisa, Italy
来源
INFORMATION RETRIEVAL JOURNAL | 2018年 / 21卷 / 2-3期
关键词
Image retrieval; Cross-media retrieval; Text representation;
D O I
10.1007/s10791-017-9318-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6-fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The TEXT2VIS models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.
引用
收藏
页码:208 / 229
页数:22
相关论文
共 47 条
[1]  
[Anonymous], 2014, Computer Vision-ECCV, P740
[2]  
[Anonymous], 2015, Microsoft coco captions: Data collection and evaluation server
[3]  
[Anonymous], 2014, IEEE COMPUT SOC CONF, DOI [10.1109/cvprw.2014.131, DOI 10.1109/CVPRW.2014.131]
[4]  
[Anonymous], 2013, Decaf: A deep convolutional activation feature for generic visual recognition
[5]  
[Anonymous], 2014, ARXIV14117399
[6]  
[Anonymous], 2014, Advances in neural information processing systems
[7]   Bag-of-Words Based Deep Neural Network for Image Retrieval [J].
Bai, Yalong ;
Yu, Wei ;
Xiao, Tianjun ;
Xu, Chang ;
Yang, Kuiyuan ;
Ma, Wei-Ying ;
Zhao, Tiejun .
PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :229-232
[8]   Image2Emoji: Zero-shot Emoji Prediction for Visual Media [J].
Cappallo, Spencer ;
Mensink, Thomas ;
Snoek, Cees G. M. .
MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, :1311-1314
[9]  
Cheng H-T, 2016, P 1 WORKSH DEEP LEAR, P7, DOI DOI 10.1145/2988450.2988454
[10]  
Cho K., 2014, ARXIV14061078, P1724, DOI 10.3115/V1/D14-1179