Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

被引：4

作者：

Lu, Haoyu ^{[1
]}

Huo, Yuqi ^{[1
]}

Ding, Mingyu ^{[2
]}

Fei, Nanyi ^{[1
]}

Lu, Zhiwu ^{[1
]}

机构：

[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing 100872, Peoples R China

[2] Univ Hong Kong, Hong Kong 999077, Peoples R China

来源：

MACHINE INTELLIGENCE RESEARCH | 2023年 / 20卷 / 04期

关键词：

Image-text retrieval; multimodal modeling; contrastive learning; weak correlation; computer vision;

D O I：

10.1007/s11633-022-1386-4

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.

引用

页码：569 / 582

页数：14

共 50 条

[1] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Haoyu Lu
Yuqi Huo
Mingyu Ding
Nanyi Fei
Zhiwu Lu
Machine Intelligence Research, 2023, 20 : 569 - 582
[2] Image-Text Cross-Modal Retrieval with Instance Contrastive Embedding
Zeng, Ruigeng
Ma, Wentao
Wu, Xiaoqian
Liu, Wei
Liu, Jie
ELECTRONICS, 2024, 13 (02)
[3] Cross-modal Image-Text Retrieval with Multitask Learning
Luo, Junyu
Shen, Ying
Ao, Xiang
Zhao, Zhou
Yang, Min
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
[4] Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval
Yang C.
Liu L.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (05): : 751 - 759
[5] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
Zeng, Sheng
Liu, Changhong
Zhou, Jun
Chen, Yong
Jiang, Aiwen
Li, Hanxi
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248
[6] Image-text bidirectional learning network based cross-modal retrieval
Li, Zhuoyi
Lu, Huibin
Fu, Hao
Gu, Guanghua
NEUROCOMPUTING, 2022, 483 : 148 - 159
[7] Cross-Modal Image-Text Retrieval with Semantic Consistency
Chen, Hui
Ding, Guiguang
Lin, Zijin
Zhao, Sicheng
Han, Jungong
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
[8] Rethinking Benchmarks for Cross-modal Image-text Retrieval
Chen, Weijing
Yao, Linli
Jin, Qin
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1241 - 1251
[9] An Efficient Cross-Modal Privacy-Preserving Image-Text Retrieval Scheme
Zhang, Kejun
Xu, Shaofei
Song, Yutuo
Xu, Yuwei
Li, Pengcheng
Yang, Xiang
Zou, Bing
Wang, Wenbin
SYMMETRY-BASEL, 2024, 16 (08):
[10] Masking-Based Cross-Modal Remote Sensing Image-Text Retrieval via Dynamic Contrastive Learning
Zhao, Zuopeng
Miao, Xiaoran
He, Chen
Hu, Jianfeng
Min, Bingbing
Gao, Yumeng
Liu, Ying
Pharksuwan, Kanyaphakphachsorn
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15

← 1 2 3 4 5 →