mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

被引:13
作者
Li, Xiaotong [1 ]
Ge, Yixiao [2 ]
Yi, Kun [2 ]
Hu, Zixuan [1 ]
Shan, Ying [2 ]
Duan, Ling-Yu [1 ,3 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Tencent PCG, ARC Lab, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
COMPUTER VISION - ECCV 2022, PT XXX | 2022年 / 13690卷
基金
中国国家自然科学基金;
关键词
Self-supervised learning; Vision transformers; Image BERT pre-training;
D O I
10.1007/978-3-031-20056-4_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better "tokenizer" can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image "tokenizer" and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pretrained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet1K classification, 49.2% APb and 44.0% APm of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code is available at https://github.com/lixiaotong97/mc-BEiT.
引用
收藏
页码:231 / 246
页数:16
相关论文
共 45 条
[1]  
Bao Hangbo, 2022, ICLR
[2]  
Bardes Adrien, 2022, ICLR
[3]  
Brown TB, 2020, ADV NEUR IN, V33
[4]  
Caron M, 2020, ADV NEUR IN, V33
[5]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[6]  
Chen T., 2020, Advances in neural information processing systems, P22243
[7]  
Chen T, 2020, PR MACH LEARN RES, V119
[8]  
Chen XK, 2022, Arxiv, DOI arXiv:2202.03026
[9]  
Chen XL, 2020, Arxiv, DOI arXiv:2003.04297
[10]   Exploring Simple Siamese Representation Learning [J].
Chen, Xinlei ;
He, Kaiming .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15745-15753