Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning

被引：26

作者：

Wang, Jian ^{[1
]}

He, Yonghao ^{[1
]}

Kang, Cuicui ^{[1
]}

Xiang, Shiming ^{[1
]}

Pan, Chunhong ^{[1
]}

机构：

[1] Chinese Acad Sci, Beijing, Peoples R China

来源：

ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL | 2015年

关键词：

Cross-modal Retrieval; Convolutional Neural Network; Feature Learning; Deep Neural Network;

D O I：

10.1145/2671188.2749341

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and texts directly. Many existing methods rely on low-level visual features and textual features for cross-modal retrieval, ignoring the characteristics existing in the raw data of different modalities. In this paper, a novel model based on modality-specific feature learning is proposed. Considering the characteristics of different modalities, the model uses two types of convolutional neural networks to map the raw data to the latent space representations for images and texts, respectively. Particularly, the convolution based network used for texts involves word embedding learning, which has been proved effective to extract meaningful textual features for text classification. In the latent space, the mapped features of images and texts form relevant and irrelevant image-text pairs, which are used by the one-vs-more learning scheme. This learning scheme can achieve ranking functionality by allowing for one relevant and more irrelevant pairs. The standard backpropagation technique is employed to update the parameters of two convolutional networks. Extensive cross-modal retrieval experiments are carried out on three challenging datasets that consist of image-document pairs or image-query clickthrough data from a search engine, and the results firmly demonstrate that the proposed model is much more effective.

引用

页码：347 / 354

页数：8

共 37 条

[1] [Anonymous], 2007, P 24 INT C MACH LEAR, DOI DOI 10.1145/1273496.1273577
[2] [Anonymous], 2010, P 18 ACM INT C MULT
[3] [Anonymous], 2013, P ACM INT C MULTIMED
[4] [Anonymous], 2013, P 3 ACM INT C MULT R
[5] [Anonymous], 2006, INT WORKSH ONTOIMAGE
[6] Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48
[7] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[8] Chua T., 2009, P ACM INT C IM VID R
[9] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10] Cross-modal Retrieval with Correspondence Autoencoder
Feng, Fangxiang
Wang, Xiaojie
Li, Ruifan
[J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16

← 1 2 3 4 →