Learning to Embed Semantic Similarity for Joint Image-Text Retrieval

被引:6
作者
Malali, Noam [1 ]
Keller, Yosi [1 ]
机构
[1] Bar Ilan Univ, Fac Engn, IL-5290002 Ramat Gan, Israel
关键词
Text and image fusion; deep learning; joint embedding;
D O I
10.1109/TPAMI.2021.3132163
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a deep learning approach for learning the joint semantic embeddings of images and captions in a euclidean space, such that the semantic similarity is approximated by the L-2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches.
引用
收藏
页码:10252 / 10260
页数:9
相关论文
共 42 条
  • [1] AKAHO S., 2006, ARXIV
  • [2] Andrienko G., 2013, Introduction, P1
  • [3] Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/TPAMI.2017.2711011, 10.1109/CVPR.2016.572]
  • [4] Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions
    Ba, Jimmy Lei
    Swersky, Kevin
    Fidler, Sanja
    Salakhutdinov, Ruslan
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4247 - 4255
  • [5] Correlational Neural Networks
    Chandar, Sarath
    Khapra, Mitesh M.
    Larochelle, Hugo
    Ravindran, Balaraman
    [J]. NEURAL COMPUTATION, 2016, 28 (02) : 257 - 285
  • [6] Chechik G, 2010, J MACH LEARN RES, V11, P1109
  • [7] Beyond triplet loss: a deep quadruplet network for person re-identification
    Chen, Weihua
    Chen, Xiaotang
    Zhang, Jianguo
    Huang, Kaiqi
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1320 - 1329
  • [8] De Brabandere B, 2017, Arxiv, DOI [arXiv:1708.02551, 10.48550/arXiv.1708.02551]
  • [9] Linking Image and Text with 2-Way Nets
    Eisenschtat, Aviv
    Wolf, Lior
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1855 - 1865
  • [10] Faghri F., 2018, PROC BRIT MACH VIS C