Learning hierarchical embedding space for image-text matching

被引:0
作者
Sun, Hao [1 ]
Qin, Xiaolin [1 ]
Liu, Xiaojing [1 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Information retrieval; cross-modal representation; hierarchical embedding; local alignment;
D O I
10.3233/IDA-230214
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There are two mainstream strategies for image-text matching at present. The one, termed as joint embedding learning, aims to model the semantic information of both image and sentence in a shared feature subspace, which facilitates the measurement of semantic similarity but only focuses on global alignment relationship. To explore the local semantic relationship more fully, the other one, termed as metric learning, aims to learn a complex similarity function to directly output score of each image-text pair. However, it significantly suffers from more computation burden at retrieval stage. In this paper, we propose a hierarchically joint embedding model to incorporate the local semantic relationship into a joint embedding learning framework. The proposed method learns the shared local and global embedding spaces simultaneously, and models the joint local embedding space with respect to specific local similarity labels which are easy to access from the lexical information of corpus. Unlike the methods based on metric learning, we can prepare the fixed representations of both images and sentences by concatenating the normalized local and global representations, which makes it feasible to perform the efficient retrieval. And experiments show that the proposed model can achieve competitive performance when compared to the existing joint embedding learning models on two publicly available datasets Flickr30k and MS-COCO.
引用
收藏
页码:647 / 665
页数:19
相关论文
共 44 条
  • [1] Ba J. L., 2016, arXiv, DOI DOI 10.48550/ARXIV.1607.06450
  • [2] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
    Chen, Hui
    Ding, Guiguang
    Liu, Xudong
    Lin, Zijia
    Liu, Ji
    Han, Jungong
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 12652 - 12660
  • [3] Cho KYHY, 2014, Arxiv, DOI arXiv:1409.1259
  • [4] Collobert R., 2011, BIGLEARN NIPS WORKSH
  • [5] Linking Image and Text with 2-Way Nets
    Eisenschtat, Aviv
    Wolf, Lior
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1855 - 1865
  • [6] Faghri F, 2018, Arxiv, DOI [arXiv:1707.05612, 10.48550/ARXIV.1707.05612]
  • [7] Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval
    Ge, Xuri
    Chen, Fuhai
    Jose, Joemon M.
    Ji, Zhilong
    Wu, Zhongqin
    Liu, Xiao
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5185 - 5193
  • [8] Fast R-CNN
    Girshick, Ross
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1440 - 1448
  • [9] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
    Gu, Jiuxiang
    Cai, Jianfei
    Joty, Shafiq
    Niu, Li
    Wang, Gang
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7181 - 7189
  • [10] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778