MST: Masked Self-Supervised Transformer for Visual Representation

被引:0
|
作者
Li, Zhaowen [1 ,2 ,3 ]
Chen, Zhiyang [1 ,2 ]
Yang, Fan [3 ]
Li, Wei [3 ]
Zhu, Yousong [1 ]
Zhao, Chaoyang [1 ]
Deng, Rui [3 ,4 ]
Wu, Liwei [3 ]
Zhao, Rui [3 ]
Tang, Ming [1 ]
Wang, Jinqiao [1 ,2 ]
机构
[1] Chinese Acad Sci, Natl Lab Pattern Recognit, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] SenseTime Res, Hong Kong, Peoples R China
[4] Univ Calif Los Angeles, Los Angeles, CA USA
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning
    Wei, Yibing
    Gupta, Abhinav
    Morgado, Pedro
    COMPUTER VISION - ECCV 2024, PT XXXIX, 2025, 15097 : 1 - 17
  • [2] Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection
    Madan, Neelu
    Ristea, Nicolae-Catalin
    Ionescu, Radu Tudor
    Nasrollahi, Kamal
    Khan, Fahad Shahbaz
    Moeslund, Thomas B.
    Shah, Mubarak
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (01) : 525 - 542
  • [3] A Survey on Masked Autoencoder for Visual Self-supervised Learning
    Zhang, Chaoning
    Zhang, Chenshuang
    Song, Junha
    Yi, John Seon Keun
    Kweon, In So
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6805 - 6813
  • [4] Self-Supervised Dense Visual Representation Learning
    Ozcelik, Timoteos Onur
    Gokberk, Berk
    Akarun, Lale
    32ND IEEE SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU 2024, 2024,
  • [5] Revisiting Self-Supervised Visual Representation Learning
    Kolesnikov, Alexander
    Zhai, Xiaohua
    Beyer, Lucas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1920 - 1929
  • [6] Masked Motion Encoding for Self-Supervised Video Representation Learning
    Sun, Xinyu
    Chen, Peihao
    Chen, Liangwei
    Li, Changhao
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2235 - 2245
  • [7] Node and edge dual-masked self-supervised graph representation
    Tang, Peng
    Xie, Cheng
    Duan, Haoran
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (04) : 2307 - 2326
  • [8] GMAEEG: A Self-Supervised Graph Masked Autoencoder for EEG Representation Learning
    Fu, Zanhao
    Zhu, Huaiyu
    Zhao, Yisheng
    Huan, Ruohong
    Zhang, Yi
    Chen, Shuohui
    Pan, Yun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (11) : 6486 - 6497
  • [9] Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning
    Chen, Yuxiao
    Zhao, Long
    Yuan, Jianbo
    Tian, Yu
    Xia, Zhaoyang
    Geng, Shijie
    Han, Ligong
    Metaxas, Dimitris N.
    COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 185 - 202
  • [10] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
    Liu, Andy T.
    Li, Shang-Wen
    Lee, Hung-yi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2351 - 2366