Self-Supervised Correlation Learning for Cross-Modal Retrieval

被引:29
作者
Liu, Yaxin [1 ]
Wu, Jianlong [1 ]
Qu, Leigang [1 ]
Gan, Tian [1 ]
Yin, Jianhua [1 ]
Nie, Liqiang [1 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; self-supervised contrastive learning; mutual information estimation;
D O I
10.1109/TMM.2022.3152086
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting from labeled data comes at a high cost since labeling data often requires enormous labor resources, especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal learning is of crucial importance in real-world applications. In this paper, we propose a novel unsupervised cross-modal retrieval method, named Self-supervised Correlation Learning (SCL), which takes full advantage of large amounts of unlabeled data to learn discriminative and modality-invariant representations. Since unsupervised learning lacks the supervision of category labels, we incorporate the knowledge from the input as a supervisory signal by maximizing the mutual information between the input and the output of different modality-specific projectors. Besides, for the purpose of learning discriminative representations, we exploit unsupervised contrastive learning to model the relationship among intra- and inter-modality instances, which makes similar samples closer and pushes dissimilar samples apart. Moreover, to further eliminate the modality gap, we use a weight-sharing scheme and minimize the modality-invariant loss in the joint representation space. Beyond that, we also extend the proposed method to the semi-supervised setting. Extensive experiments conducted on three widely-used benchmark datasets demonstrate that our method achieves competitive results compared with current state-of-the-art cross-modal retrieval approaches.
引用
收藏
页码:2851 / 2863
页数:13
相关论文
共 54 条
  • [1] Andrew G., 2013, P 30 INT C MACH LEAR, V28, P2284
  • [2] [Anonymous], 2016, IJCAI
  • [3] [Anonymous], 2010, P 13 INT C ARTIFICIA
  • [4] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [5] Belghazi MI, 2018, PR MACH LEARN RES, V80
  • [6] On Sampling Strategies for Neural Network-based Collaborative Filtering
    Chen, Ting
    Sun, Yizhou
    Shi, Yue
    Hong, Liangjie
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 767 - 776
  • [7] Chen Ting, 2019, 25 AMERICAS C INFORM
  • [8] Integrating information theory and adversarial learning for cross-modal retrieval
    Chen, Wei
    Liu, Yu
    Bakker, Erwin M.
    Lew, Michael S.
    [J]. PATTERN RECOGNITION, 2021, 117
  • [9] Learning to Filter: Siamese Relation Network for Robust Tracking
    Cheng, Siyuan
    Zhong, Bineng
    Li, Guorong
    Liu, Xin
    Tang, Zhenjun
    Li, Xianxian
    Wang, Jing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4419 - 4429
  • [10] Chua T. S., 2009, P ACM INT C IM VID, P48