Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning

被引:26
作者
Wang, Jian [1 ]
He, Yonghao [1 ]
Kang, Cuicui [1 ]
Xiang, Shiming [1 ]
Pan, Chunhong [1 ]
机构
[1] Chinese Acad Sci, Beijing, Peoples R China
来源
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL | 2015年
关键词
Cross-modal Retrieval; Convolutional Neural Network; Feature Learning; Deep Neural Network;
D O I
10.1145/2671188.2749341
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and texts directly. Many existing methods rely on low-level visual features and textual features for cross-modal retrieval, ignoring the characteristics existing in the raw data of different modalities. In this paper, a novel model based on modality-specific feature learning is proposed. Considering the characteristics of different modalities, the model uses two types of convolutional neural networks to map the raw data to the latent space representations for images and texts, respectively. Particularly, the convolution based network used for texts involves word embedding learning, which has been proved effective to extract meaningful textual features for text classification. In the latent space, the mapped features of images and texts form relevant and irrelevant image-text pairs, which are used by the one-vs-more learning scheme. This learning scheme can achieve ranking functionality by allowing for one relevant and more irrelevant pairs. The standard backpropagation technique is employed to update the parameters of two convolutional networks. Extensive cross-modal retrieval experiments are carried out on three challenging datasets that consist of image-document pairs or image-query clickthrough data from a search engine, and the results firmly demonstrate that the proposed model is much more effective.
引用
收藏
页码:347 / 354
页数:8
相关论文
共 37 条
  • [1] [Anonymous], 2007, P 24 INT C MACH LEAR, DOI DOI 10.1145/1273496.1273577
  • [2] [Anonymous], 2010, P 18 ACM INT C MULT
  • [3] [Anonymous], 2013, P ACM INT C MULTIMED
  • [4] [Anonymous], 2013, P 3 ACM INT C MULT R
  • [5] [Anonymous], 2006, INT WORKSH ONTOIMAGE
  • [6] Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48
  • [7] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [8] Chua T., 2009, P ACM INT C IM VID R
  • [9] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [10] Cross-modal Retrieval with Correspondence Autoencoder
    Feng, Fangxiang
    Wang, Xiaojie
    Li, Ruifan
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16