Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

被引：14

作者：

Balaneshin-kordan, Saeid ^{[1
]}

Kotov, Alexander ^{[1
]}

机构：

[1] Wayne State Univ, Detroit, MI 48202 USA

来源：

WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING | 2018年

关键词：

Multi-Modal IR; Cross-Modal IR; Deep Neural Networks;

D O I：

10.1145/3159652.3159735

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in deep learning and distributed representations of images and text have resulted in the emergence of several neural architectures for cross-modal retrieval tasks, such as searching collections of images in response to textual queries and assigning textual descriptions to images. However, the multi-modal retrieval scenario, when a query can be either a text or an image and the goal is to retrieve both a textual fragment and an image, which should be considered as an atomic unit, has been significantly less studied. In this paper, we propose a gated neural architecture to project image and keyword queries as well as multi-modal retrieval units into the same low-dimensional embedding space and perform semantic matching in this space. The proposed architecture is trained to minimize structured hinge loss and can be applied to both cross- and multi-modal retrieval. Experimental results for six different cross-and multi-modal retrieval tasks obtained on publicly available datasets indicate superior retrieval accuracy of the proposed architecture in comparison to the state-of-art baselines.

引用

页码：28 / 36

页数：9

共 60 条

[21]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[22]

Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754

[23]

Frome A., 2013, NeurIPS, P2121, DOI DOI 10.5555/2999792.2999849

[24] A Word Embedding based Generalized Language Model for Information Retrieval [J].

Ganguly, Debasis ;

Roy, Dwaipayan ;

Mitra, Mandar ;

Jones, Gareth J. F. .

SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, :795-798

[25]

Gong YC, 2014, LECT NOTES COMPUT SC, V8692, P529, DOI 10.1007/978-3-319-10593-2_35

[26] A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics [J].

Gong, Yunchao ;

Ke, Qifa ;

Isard, Michael ;

Lazebnik, Svetlana .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 106 (02) :210-233

[27] A Deep Relevance Matching Model for Ad-hoc Retrieval [J].

Guo, Jiafeng ;

Fan, Yixing ;

Ai, Qingyao ;

Croft, W. Bruce .

CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, :55-64

[28] Canonical correlation analysis: An overview with application to learning methods [J].

Hardoon, DR ;

Szedmak, S ;

Shawe-Taylor, J .

NEURAL COMPUTATION, 2004, 16 (12) :2639-2664

[29] Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics [J].

Hodosh, Micah ;

Young, Peter ;

Hockenmaier, Julia .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :853-899

[30]

Huang PS, 2013, PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), P2333

← 1 2 3 4 5 6 →