MST: Masked Self-Supervised Transformer for Visual Representation

被引:0
|
作者
Li, Zhaowen [1 ,2 ,3 ]
Chen, Zhiyang [1 ,2 ]
Yang, Fan [3 ]
Li, Wei [3 ]
Zhu, Yousong [1 ]
Zhao, Chaoyang [1 ]
Deng, Rui [3 ,4 ]
Wu, Liwei [3 ]
Zhao, Rui [3 ]
Tang, Ming [1 ]
Wang, Jinqiao [1 ,2 ]
机构
[1] Chinese Acad Sci, Natl Lab Pattern Recognit, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] SenseTime Res, Hong Kong, Peoples R China
[4] Univ Calif Los Angeles, Los Angeles, CA USA
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
    Li, Yuanyuan
    Alkhalifah, Tariq
    Huang, Jianping
    Li, Zhenchun
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [22] Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
    Wang, Rui
    Chen, Dongdong
    Wu, Zuxuan
    Chen, Yinpeng
    Dai, Xiyang
    Liu, Mengchen
    Yuan, Lu
    Jiang, Yu-Gang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6312 - 6322
  • [23] Masked Feature Prediction for Self-Supervised Visual Pre-Training
    Wei, Chen
    Fan, Haoqi
    Xie, Saining
    Wu, Chao-Yuan
    Yuille, Alan
    Feichtenhofer, Christoph
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14648 - 14658
  • [24] Self-supervised Video Transformer
    Ranasinghe, Kanchana
    Naseer, Muzammal
    Khan, Salman
    Khan, Fahad Shahbaz
    Ryoo, Michael S.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2864 - 2874
  • [25] Cross-View Masked Model for Self-Supervised Graph Representation Learning
    Duan H.
    Yu B.
    Xie C.
    IEEE Transactions on Artificial Intelligence, 2024, 5 (11): : 1 - 13
  • [26] Masked self-supervised ECG representation learning via multiview information bottleneck
    Yang, Shunxiang
    Lian, Cheng
    Zeng, Zhigang
    Xu, Bingrong
    Su, Yixin
    Xue, Chenyang
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (14): : 7625 - 7637
  • [27] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
  • [28] Masked self-supervised ECG representation learning via multiview information bottleneck
    Shunxiang Yang
    Cheng Lian
    Zhigang Zeng
    Bingrong Xu
    Yixin Su
    Chenyang Xue
    Neural Computing and Applications, 2024, 36 : 7625 - 7637
  • [29] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2021, 2021, : 1169 - 1173
  • [30] AST: Adaptive Self-supervised Transformer for optical remote sensing representation
    He, Qibin
    Sun, Xian
    Yan, Zhiyuan
    Wang, Bing
    Zhu, Zicong
    Diao, Wenhui
    Yang, Michael Ying
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2023, 200 : 41 - 54