Seed the Views: Hierarchical Semantic Alignment for Contrastive Representation Learning

被引：13

作者：

Xu, Haohang ^{[1
]}

Zhang, Xiaopeng ^{[2
]}

Li, Hao ^{[1
]}

Xie, Lingxi ^{[2
]}

Dai, Wenrui ^{[1
]}

Xiong, Hongkai ^{[1
]}

Tian, Qi ^{[2
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China

[2] Huawei Inc, Shenzhen 518129, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Task analysis; Training; Generators; Feature extraction; Semantics; Representation learning; Image reconstruction; Self-supervised learning; unsupervised learning; contrastive learning;

D O I：

10.1109/TPAMI.2022.3176690

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised learning based on instance discrimination has shown remarkable progress. In particular, contrastive learning, which regards each image as well as its augmentations as an individual class and tries to distinguish them from all other images, has been verified effective for representation learning. However, conventional contrastive learning does not model the relation between semantically similar samples explicitly. In this paper, we propose a general module that considers the semantic similarity among images. This is achieved by expanding the views generated by a single image to Cross-Samples and Multi-Levels, and modeling the invariance to semantically similar images in a hierarchical way. Specifically, the cross-samples are generated by a data mixing operation, which is constrained within samples that are semantically similar, while the multi-level samples are expanded at the intermediate layers of a network. In this way, the contrastive loss is extended to allow for multiple positives per anchor, and explicitly pulling semantically similar images together at different layers of the network. Our method, termed as CSML, has the ability to integrate multi-level representations across samples in a robust way. CSML is applicable to current contrastive based methods and consistently improves the performance. Notably, using MoCo v2 as an instantiation, CSML achieves 76.6% top-1 accuracy with linear evaluation using ResNet-50 as backbone, 66.7% and 75.1% top-1 accuracy with only 1% and 10% labels, respectively. All these numbers set the new state-of-the-art. The code is available at https://github.com/haohang96/CSML.

引用

页码：3753 / 3767

页数：15

共 71 条

[1]

Bachman P, 2019, ADV NEUR IN, V32

[2]

Berthelot D, 2019, ADV NEUR IN, V32

[3]

Caron M, 2020, ADV NEUR IN, V33

[4] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[5] Deep Clustering for Unsupervised Learning of Visual Features [J].

Caron, Mathilde ;

Bojanowski, Piotr ;

Joulin, Armand ;

Douze, Matthijs .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156

[6]

Chen T., 2020, Advances in neural information processing systems, V33, P22243

[7]

Chen Ting, 2019, PMLR

[8]

Chen Xinlei, 2020, ARXIV, P5

[9] Describing Textures in the Wild [J].

Cimpoi, Mircea ;

Maji, Subhransu ;

Kokkinos, Iasonas ;

Mohamed, Sammy ;

Vedaldi, Andrea .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3606-3613

[10]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

← 1 2 3 4 5 6 7 8 →