SERE: Exploring Feature Self-Relation for Self-Supervised Transformer

被引：3

作者：

Li, Zhong-Yu ^{[1
]}

Gao, Shanghua ^{[1
]}

Cheng, Ming-Ming ^{[1
]}

机构：

[1] Nankai Univ, TMCC, CS, Tianjin 300350, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 12期

关键词：

Training; Self-supervised learning; Task analysis; Feature extraction; Convolutional neural networks; Transformers; Semantics; vision transformer; feature self-relation;

D O I：

10.1109/TPAMI.2023.3309979

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., spatial/channel self-relations, for self-supervised learning. Self-relation based learning further enhances the relation modeling ability of ViT, resulting in stronger representations that stably improve performance on multiple downstream tasks.

引用

页码：15619 / 15631

页数：13

共 104 条

[1] Agarap A. F., 2018, arXiv, DOI DOI 10.48550/ARXIV.1803.08375
[2] Ahmed S.A.A., 2021, arXiv
[3] Bao H., 2022, P INT C LEARN REPR
[4] Bardes A., 2022, P INT C LEARN REPR
[5] Cascade R-CNN: Delving into High Quality Object Detection
Cai, Zhaowei
Vasconcelos, Nuno
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6154 - 6162
[6] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[7] Caron M, 2020, ADV NEUR IN, V33
[8] Emerging Properties in Self-Supervised Vision Transformers
Caron, Mathilde
Touvron, Hugo
Misra, Ishan
Jegou, Herve
Mairal, Julien
Bojanowski, Piotr
Joulin, Armand
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
[9] Deep Clustering for Unsupervised Learning of Visual Features
Caron, Mathilde
Bojanowski, Piotr
Joulin, Armand
Douze, Matthijs
[J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 139 - 156
[10] Chen T, 2020, PR MACH LEARN RES, V119

← 1 2 3 4 5 6 7 8 9 10 →