Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

被引:4
|
作者
Lu, Haoyu [1 ]
Huo, Yuqi [1 ]
Ding, Mingyu [2 ]
Fei, Nanyi [1 ]
Lu, Zhiwu [1 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing 100872, Peoples R China
[2] Univ Hong Kong, Hong Kong 999077, Peoples R China
关键词
Image-text retrieval; multimodal modeling; contrastive learning; weak correlation; computer vision;
D O I
10.1007/s11633-022-1386-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.
引用
收藏
页码:569 / 582
页数:14
相关论文
共 50 条
  • [21] SAM: cross-modal semantic alignments module for image-text retrieval
    Pilseo Park
    Soojin Jang
    Yunsung Cho
    Youngbin Kim
    Multimedia Tools and Applications, 2024, 83 : 12363 - 12377
  • [22] Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Panda, Rameswar
    Papalexakis, Evangelos E.
    Roy-Chowdhury, Amit K.
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1856 - 1864
  • [23] SAM: cross-modal semantic alignments module for image-text retrieval
    Park, Pilseo
    Jang, Soojin
    Cho, Yunsung
    Kim, Youngbin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12363 - 12377
  • [24] An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval
    Zhang, Jinzhi
    Wang, Luyao
    Zheng, Fuzhong
    Wang, Xu
    Zhang, Haisu
    REMOTE SENSING, 2024, 16 (12)
  • [25] RICH: A rapid method for image-text cross-modal hash retrieval
    Li, Bo
    Yao, Dan
    Li, Zhixin
    DISPLAYS, 2023, 79
  • [26] Deep Cross-Modal Projection Learning for Image-Text Matching
    Zhang, Ying
    Lu, Huchuan
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 707 - 723
  • [27] Improving text-image cross-modal retrieval with contrastive loss
    Zhang, Chumeng
    Yang, Yue
    Guo, Junbo
    Jin, Guoqing
    Song, Dan
    Liu, An An
    MULTIMEDIA SYSTEMS, 2023, 29 (02) : 569 - 575
  • [28] A TEXTURE AND SALIENCY ENHANCED IMAGE LEARNING METHOD FOR CROSS-MODAL REMOTE SENSING IMAGE-TEXT RETRIEVAL
    Yang, Rui
    Zhang, Di
    Guo, YanHe
    Wang, Shuang
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 4895 - 4898
  • [29] Improving text-image cross-modal retrieval with contrastive loss
    Chumeng Zhang
    Yue Yang
    Junbo Guo
    Guoqing Jin
    Dan Song
    An An Liu
    Multimedia Systems, 2023, 29 : 569 - 575
  • [30] A Cross-modal image retrieval method based on contrastive learning
    Zhou, Wen
    JOURNAL OF OPTICS-INDIA, 2024, 53 (03): : 2098 - 2107