Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization

被引：4

作者：

Zhang, Lei ^{[1
,5
]}

Yang, Min ^{[1
]}

Li, Chengming ^{[2
]}

Xu, Ruifeng ^{[3
,4
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China

[2] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Shenzhen, Peoples R China

[3] Harbin Inst Technol, Shenzhen, Peoples R China

[4] Peng Cheng Lab, Shenzhen, Peoples R China

[5] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) | 2022年

基金：

中国国家自然科学基金;

关键词：

Cross-modal image-text retrieval; Contrastive learning; Support-set regularization; Generative features;

D O I：

10.1145/3477495.3531783

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we bridge the heterogeneity gap between different modalities and improve image-text retrieval by taking advantage of auxiliary image-to-text and text-to-image generative features with contrastive learning. Concretely, contrastive learning is devised to narrow the distance between the aligned image-text pairs and push apart the distance between the unaligned pairs from both inter- and intra-modality perspectives with the help of cross-modal retrieval features and auxiliary generative features. In addition, we devise a support-set regularization term to further improve contrastive learning by constraining the distance between each image/text and its corresponding cross-modal support-set information contained in the same semantic category. To evaluate the effectiveness of the proposed method, we conduct experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, MS COCO). Experimental results show that our model significantly outperforms the strong baselines for cross-modal image-text retrieval. For reproducibility, we submit the code and data publicly at: https://github.com/Hambaobao/CRCGS.

引用

页码：1938 / 1943

页数：6

共 21 条

[1] Chua T., 2009, PROC INT C IMAGE VID, P1, DOI DOI 10.1145/1646396.1646452
[2] Cross-modal Retrieval with Correspondence Autoencoder
Feng, Fangxiang
Wang, Xiaojie
Li, Ruifan
[J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16
[3] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[4] Cross-Modal Retrieval via Deep and Bidirectional Representation Learning
He, Yonghao
Xiang, Shiming
Kang, Cuicui
Wang, Jian
Pan, Chunhong
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (07) : 1363 - 1377
[5] Cross-Modal Image-Text Retrieval with Semantic Consistency
Chen, Hui
Ding, Guiguang
Lin, Zijin
Zhao, Sicheng
Han, Jungong
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
[6] Huiskes M. J., 2008, PROC PROC 1 ACM INT, P39
[7] Deep Cross-Modal Hashing
Jiang, Qing-Yuan
Li, Wu-Jun
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3270 - 3278
[8] Kingma D P., 2014, P INT C LEARN REPR
[9] Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval
Li, Chao
Deng, Cheng
Li, Ning
Liu, Wei
Gao, Xinbo
Tao, Dacheng
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4242 - 4251
[10] Microsoft COCO: Common Objects in Context
Lin, Tsung-Yi
Maire, Michael
Belongie, Serge
Hays, James
Perona, Pietro
Ramanan, Deva
Dollar, Piotr
Zitnick, C. Lawrence
[J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755

← 1 2 3 →