MST: Masked Self-Supervised Transformer for Visual Representation

被引：0

作者：

Li, Zhaowen ^{[1
,2
,3
]}

Chen, Zhiyang ^{[1
,2
]}

Yang, Fan ^{[3
]}

Li, Wei ^{[3
]}

Zhu, Yousong ^{[1
]}

Zhao, Chaoyang ^{[1
]}

Deng, Rui ^{[3
,4
]}

Wu, Liwei ^{[3
]}

Zhao, Rui ^{[3
]}

Tang, Ming ^{[1
]}

Wang, Jinqiao ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Natl Lab Pattern Recognit, Inst Automat, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] SenseTime Res, Hong Kong, Peoples R China

[4] Univ Calif Los Angeles, Los Angeles, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

引用

页数：12

共 50 条

[21] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
Li, Yuanyuan
Alkhalifah, Tariq
Huang, Jianping
Li, Zhenchun
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[22] Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Wang, Rui
Chen, Dongdong
Wu, Zuxuan
Chen, Yinpeng
Dai, Xiyang
Liu, Mengchen
Yuan, Lu
Jiang, Yu-Gang
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6312 - 6322
[23] Masked Feature Prediction for Self-Supervised Visual Pre-Training
Wei, Chen
Fan, Haoqi
Xie, Saining
Wu, Chao-Yuan
Yuille, Alan
Feichtenhofer, Christoph
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14648 - 14658
[24] Self-supervised Video Transformer
Ranasinghe, Kanchana
Naseer, Muzammal
Khan, Salman
Khan, Fahad Shahbaz
Ryoo, Michael S.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2864 - 2874
[25] Cross-View Masked Model for Self-Supervised Graph Representation Learning
Duan H.
Yu B.
Xie C.
IEEE Transactions on Artificial Intelligence, 2024, 5 (11): : 1 - 13
[26] Masked self-supervised ECG representation learning via multiview information bottleneck
Yang, Shunxiang
Lian, Cheng
Zeng, Zhigang
Xu, Bingrong
Su, Yixin
Xue, Chenyang
NEURAL COMPUTING & APPLICATIONS, 2024, 36 (14): : 7625 - 7637
[27] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hsu, Wei-Ning
Bolte, Benjamin
Tsai, Yao-Hung Hubert
Lakhotia, Kushal
Salakhutdinov, Ruslan
Mohamed, Abdelrahman
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
[28] Masked self-supervised ECG representation learning via multiview information bottleneck
Shunxiang Yang
Cheng Lian
Zhigang Zeng
Bingrong Xu
Yixin Su
Chenyang Xue
Neural Computing and Applications, 2024, 36 : 7625 - 7637
[29] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
Luo, Jian
Wang, Jianzong
Cheng, Ning
Xiao, Jing
INTERSPEECH 2021, 2021, : 1169 - 1173
[30] AST: Adaptive Self-supervised Transformer for optical remote sensing representation
He, Qibin
Sun, Xian
Yan, Zhiyuan
Wang, Bing
Zhu, Zicong
Diao, Wenhui
Yang, Michael Ying
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2023, 200 : 41 - 54

← 1 2 3 4 5 →