Learning What and Where to Learn: A New Perspective on Self-Supervised Learning

被引：5

作者：

Zhao, Wenyi ^{[1
]}

Yang, Lu ^{[1
]}

Zhang, Weidong ^{[2
]}

Tian, Yongqin ^{[2
]}

Jia, Wenhe ^{[1
]}

Li, Wei ^{[1
]}

Yang, Mu ^{[3
]}

Pan, Xipeng ^{[4
]}

Yang, Huihua ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing 100876, Peoples R China

[2] Henan Inst Sci & Technol, Sch Informat Engn, Xinxiang 453003, Peoples R China

[3] Techmach Beijing Ind Technol Co Ltd, Beijing 102676, Peoples R China

[4] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Guilin 541004, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Semantics; Feature extraction; Task analysis; Computational modeling; Optimization; Self-supervised learning; Training; learning what; learning where; efficient framework; positional information;

D O I：

10.1109/TCSVT.2023.3298937

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Self-supervised learning (SSL) has demonstrated its power in generalized model acquisition by leveraging the discriminative semantic and explicit positional information of unlabeled datasets. Unfortunately, mainstream contrastive learning-based methods excessive focus on semantic information and ignore the position is also the carrier of image content, resulting in inadequate data utilization and extensive computational consumption. To address these issues, we present an efficient SSL framework, learning What and Where to learn ( $\text {W}<^>{2} \text {SSL}$ ), to aggregate semantic and position features. Concretely, we devise a spatially-coupled sampling manner to process images through pre-defined rules, which integrates the advantage of semantic (What) and positional (Where) features into framework to enrich the diversity of feature representation capabilities and improve data utilization. Besides, a spectrum of latent vectors is obtained by mapping the positional features, which implicitly explores the relationship between these vectors. Whereafter, the corresponding discriminative and contrastive optimization objectives are seamlessly embedded in the framework via a cascade paradigm to explore semantic and positional features. The proposed $\text {W}<^>{2} \text {SSL}$ is verified on different types of datasets, which demonstrates that it still outperforms state-of-the-art SSL methods even with half the computational consumption. Code will be available at https://github.com/WilyZhao8/W2SSL.

引用

页码：6620 / 6633

页数：14

共 59 条

[1]

[Anonymous], 2009, Cifar-10

[2]

Asano Y.M., 2020, ICLR

[3]

Caron M, 2020, ADV NEUR IN, V33

[4]

Chang JL, 2017, IEEE I CONF COMP VIS, P5880, DOI [10.1109/ICCV.2017.626, 10.1109/ICCV.2017.627]

[5] Jigsaw Clustering for Unsupervised Visual Representation Learning [J].

Chen, Pengguang ;

Liu, Shu ;

Jia, Jiaya .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11521-11530

[6] SSL plus plus : Improving Self-Supervised Learning by Mitigating the Proxy Task-Specificity Problem [J].

Chen, Song ;

Xue, Jing-Hao ;

Chang, Jianlong ;

Zhang, Jianzhong ;

Yang, Jufeng ;

Tian, Qi .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :1134-1148

[7]

Chen T., 2020, INT C MACHINE LEARNI, P1597

[8]

Chen XL, 2020, Arxiv, DOI arXiv:2003.04297

[9] Exploring Simple Siamese Representation Learning [J].

Chen, Xinlei ;

He, Kaiming .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15745-15753

[10] An Empirical Study of Training Self-Supervised Vision Transformers [J].

Chen, Xinlei ;

Xie, Saining ;

He, Kaiming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629

← 1 2 3 4 5 6 →