Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

被引:14
作者
Balaneshin-kordan, Saeid [1 ]
Kotov, Alexander [1 ]
机构
[1] Wayne State Univ, Detroit, MI 48202 USA
来源
WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING | 2018年
关键词
Multi-Modal IR; Cross-Modal IR; Deep Neural Networks;
D O I
10.1145/3159652.3159735
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in deep learning and distributed representations of images and text have resulted in the emergence of several neural architectures for cross-modal retrieval tasks, such as searching collections of images in response to textual queries and assigning textual descriptions to images. However, the multi-modal retrieval scenario, when a query can be either a text or an image and the goal is to retrieve both a textual fragment and an image, which should be considered as an atomic unit, has been significantly less studied. In this paper, we propose a gated neural architecture to project image and keyword queries as well as multi-modal retrieval units into the same low-dimensional embedding space and perform semantic matching in this space. The proposed architecture is trained to minimize structured hinge loss and can be applied to both cross- and multi-modal retrieval. Experimental results for six different cross-and multi-modal retrieval tasks obtained on publicly available datasets indicate superior retrieval accuracy of the proposed architecture in comparison to the state-of-art baselines.
引用
收藏
页码:28 / 36
页数:9
相关论文
共 60 条
[21]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[22]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[23]  
Frome A., 2013, NeurIPS, P2121, DOI DOI 10.5555/2999792.2999849
[24]   A Word Embedding based Generalized Language Model for Information Retrieval [J].
Ganguly, Debasis ;
Roy, Dwaipayan ;
Mitra, Mandar ;
Jones, Gareth J. F. .
SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, :795-798
[25]  
Gong YC, 2014, LECT NOTES COMPUT SC, V8692, P529, DOI 10.1007/978-3-319-10593-2_35
[26]   A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics [J].
Gong, Yunchao ;
Ke, Qifa ;
Isard, Michael ;
Lazebnik, Svetlana .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 106 (02) :210-233
[27]   A Deep Relevance Matching Model for Ad-hoc Retrieval [J].
Guo, Jiafeng ;
Fan, Yixing ;
Ai, Qingyao ;
Croft, W. Bruce .
CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, :55-64
[28]   Canonical correlation analysis: An overview with application to learning methods [J].
Hardoon, DR ;
Szedmak, S ;
Shawe-Taylor, J .
NEURAL COMPUTATION, 2004, 16 (12) :2639-2664
[29]   Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics [J].
Hodosh, Micah ;
Young, Peter ;
Hockenmaier, Julia .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :853-899
[30]  
Huang PS, 2013, PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), P2333