Self-Supervised Correlation Learning for Cross-Modal Retrieval

被引：29

作者：

Liu, Yaxin ^{[1
]}

Wu, Jianlong ^{[1
]}

Qu, Leigang ^{[1
]}

Gan, Tian ^{[1
]}

Yin, Jianhua ^{[1
]}

Nie, Liqiang ^{[1
]}

机构：

[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; self-supervised contrastive learning; mutual information estimation;

D O I：

10.1109/TMM.2022.3152086

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting from labeled data comes at a high cost since labeling data often requires enormous labor resources, especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal learning is of crucial importance in real-world applications. In this paper, we propose a novel unsupervised cross-modal retrieval method, named Self-supervised Correlation Learning (SCL), which takes full advantage of large amounts of unlabeled data to learn discriminative and modality-invariant representations. Since unsupervised learning lacks the supervision of category labels, we incorporate the knowledge from the input as a supervisory signal by maximizing the mutual information between the input and the output of different modality-specific projectors. Besides, for the purpose of learning discriminative representations, we exploit unsupervised contrastive learning to model the relationship among intra- and inter-modality instances, which makes similar samples closer and pushes dissimilar samples apart. Moreover, to further eliminate the modality gap, we use a weight-sharing scheme and minimize the modality-invariant loss in the joint representation space. Beyond that, we also extend the proposed method to the semi-supervised setting. Extensive experiments conducted on three widely-used benchmark datasets demonstrate that our method achieves competitive results compared with current state-of-the-art cross-modal retrieval approaches.

引用

页码：2851 / 2863

页数：13

共 54 条

[1] Andrew G., 2013, P 30 INT C MACH LEAR, V28, P2284
[2] [Anonymous], 2016, IJCAI
[3] [Anonymous], 2010, P 13 INT C ARTIFICIA
[4] Multimodal Machine Learning: A Survey and Taxonomy
Baltrusaitis, Tadas
Ahuja, Chaitanya
Morency, Louis-Philippe
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
[5] Belghazi MI, 2018, PR MACH LEARN RES, V80
[6] On Sampling Strategies for Neural Network-based Collaborative Filtering
Chen, Ting
Sun, Yizhou
Shi, Yue
Hong, Liangjie
[J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 767 - 776
[7] Chen Ting, 2019, 25 AMERICAS C INFORM
[8] Integrating information theory and adversarial learning for cross-modal retrieval
Chen, Wei
Liu, Yu
Bakker, Erwin M.
Lew, Michael S.
[J]. PATTERN RECOGNITION, 2021, 117
[9] Learning to Filter: Siamese Relation Network for Robust Tracking
Cheng, Siyuan
Zhong, Bineng
Li, Guorong
Liu, Xin
Tang, Zhenjun
Li, Xianxian
Wang, Jing
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4419 - 4429
[10] Chua T. S., 2009, P ACM INT C IM VID, P48

← 1 2 3 4 5 6 →