Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization

被引:4
作者
Zhang, Lei [1 ,5 ]
Yang, Min [1 ]
Li, Chengming [2 ]
Xu, Ruifeng [3 ,4 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[2] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Shenzhen, Peoples R China
[3] Harbin Inst Technol, Shenzhen, Peoples R China
[4] Peng Cheng Lab, Shenzhen, Peoples R China
[5] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) | 2022年
基金
中国国家自然科学基金;
关键词
Cross-modal image-text retrieval; Contrastive learning; Support-set regularization; Generative features;
D O I
10.1145/3477495.3531783
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we bridge the heterogeneity gap between different modalities and improve image-text retrieval by taking advantage of auxiliary image-to-text and text-to-image generative features with contrastive learning. Concretely, contrastive learning is devised to narrow the distance between the aligned image-text pairs and push apart the distance between the unaligned pairs from both inter- and intra-modality perspectives with the help of cross-modal retrieval features and auxiliary generative features. In addition, we devise a support-set regularization term to further improve contrastive learning by constraining the distance between each image/text and its corresponding cross-modal support-set information contained in the same semantic category. To evaluate the effectiveness of the proposed method, we conduct experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, MS COCO). Experimental results show that our model significantly outperforms the strong baselines for cross-modal image-text retrieval. For reproducibility, we submit the code and data publicly at: https://github.com/Hambaobao/CRCGS.
引用
收藏
页码:1938 / 1943
页数:6
相关论文
共 21 条
  • [1] Chua T., 2009, PROC INT C IMAGE VID, P1, DOI DOI 10.1145/1646396.1646452
  • [2] Cross-modal Retrieval with Correspondence Autoencoder
    Feng, Fangxiang
    Wang, Xiaojie
    Li, Ruifan
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16
  • [3] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
  • [4] Cross-Modal Retrieval via Deep and Bidirectional Representation Learning
    He, Yonghao
    Xiang, Shiming
    Kang, Cuicui
    Wang, Jian
    Pan, Chunhong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (07) : 1363 - 1377
  • [5] Cross-Modal Image-Text Retrieval with Semantic Consistency
    Chen, Hui
    Ding, Guiguang
    Lin, Zijin
    Zhao, Sicheng
    Han, Jungong
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
  • [6] Huiskes M. J., 2008, PROC PROC 1 ACM INT, P39
  • [7] Deep Cross-Modal Hashing
    Jiang, Qing-Yuan
    Li, Wu-Jun
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3270 - 3278
  • [8] Kingma D P., 2014, P INT C LEARN REPR
  • [9] Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval
    Li, Chao
    Deng, Cheng
    Li, Ning
    Liu, Wei
    Gao, Xinbo
    Tao, Dacheng
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4242 - 4251
  • [10] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755