MST: Masked Self-Supervised Transformer for Visual Representation

被引:0
|
作者
Li, Zhaowen [1 ,2 ,3 ]
Chen, Zhiyang [1 ,2 ]
Yang, Fan [3 ]
Li, Wei [3 ]
Zhu, Yousong [1 ]
Zhao, Chaoyang [1 ]
Deng, Rui [3 ,4 ]
Wu, Liwei [3 ]
Zhao, Rui [3 ]
Tang, Ming [1 ]
Wang, Jinqiao [1 ,2 ]
机构
[1] Chinese Acad Sci, Natl Lab Pattern Recognit, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] SenseTime Res, Hong Kong, Peoples R China
[4] Univ Calif Los Angeles, Los Angeles, CA USA
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.
引用
收藏
页数:12
相关论文
共 50 条
  • [11] Node and edge dual-masked self-supervised graph representation
    Peng Tang
    Cheng Xie
    Haoran Duan
    Knowledge and Information Systems, 2024, 66 : 2307 - 2326
  • [12] GMAEEG: A Self-Supervised Graph Masked Autoencoder for EEG Representation Learning
    Fu, Zanhao
    Zhu, Huaiyu
    Zhao, Yisheng
    Huan, Ruohong
    Zhang, Yi
    Chen, Shuohui
    Pan, Yun
    IEEE Journal of Biomedical and Health Informatics, 2024, 28 (11): : 6486 - 6497
  • [13] Mixed Autoencoder for Self-supervised Visual Representation Learning
    Chen, Kai
    Liu, Zhili
    Hong, Lanqing
    Xu, Hang
    Li, Zhenguo
    Yeung, Dit-Yan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22742 - 22751
  • [14] A survey on self-supervised methods for visual representation learning
    Uelwer, Tobias
    Robine, Jan
    Wagner, Stefan Sylvius
    Hoeftmann, Marc
    Upschulte, Eric
    Konietzny, Sebastian
    Behrendt, Maike
    Harmeling, Stefan
    MACHINE LEARNING, 2025, 114 (04)
  • [15] Scaling and Benchmarking Self-Supervised Visual Representation Learning
    Goyal, Priya
    Mahajan, Dhruv
    Gupta, Abhinav
    Misra, Ishan
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6400 - 6409
  • [16] Transitive Invariance for Self-supervised Visual Representation Learning
    Wang, Xiaolong
    He, Kaiming
    Gupta, Abhinav
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1338 - 1347
  • [17] Self-supervised Visual Representation Learning for Histopathological Images
    Yang, Pengshuai
    Hong, Zhiwei
    Yin, Xiaoxu
    Zhu, Chengzhan
    Jiang, Rui
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT II, 2021, 12902 : 47 - 57
  • [18] Self-Supervised Visual Representation Learning with Semantic Grouping
    Wen, Xin
    Zhao, Bingchen
    Zheng, Anlin
    Zhang, Xiangyu
    Qi, Xiaojuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [19] Self-supervised representation learning by predicting visual permutations
    Zhao, Qilu
    Dong, Junyu
    KNOWLEDGE-BASED SYSTEMS, 2020, 210
  • [20] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
    Li, Yuanyuan
    Alkhalifah, Tariq
    Huang, Jianping
    Li, Zhenchun
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61