GMML IS ALL YOU NEED

被引:4
作者
Atito, Sara [1 ,2 ]
Awais, Muhammed [1 ,2 ]
Nandam, Srinivasa [2 ]
Kittler, Josef [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England
[2] Univ Surrey, Surrey Inst People Centred AI, Guildford GU2 7XH, Surrey, England
来源
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年
基金
英国工程与自然科学研究理事会;
关键词
Self-supervised Learning; Vision Transformers; Group Masked Model Learning; Deep Learning;
D O I
10.1109/ICIP49359.2023.10222150
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers (ViTs) have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry and therefore often pretrained on large-scale datasets, e.g. JFT-300M or ImageNet. An ideal learning method would perform best regardless of the size of the dataset, a property lacked by current learning methods, with merely a few existing works studying ViTs with limited data. We propose Group Masked Model Learning (GMML), a self-supervised learning (SSL) method that is able to train ViTs and achieve state-of-the-art (SOTA) performance when pre-trained with limited data. The GMML uses the information conveyed by all concepts in the image. This is achieved by manipulating randomly groups of connected tokens, successively covering different meaningful parts of the image content, and then recovering the hidden information from the visible part of the concept. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor relies on careful implementation details such as large batches and gradient stopping. Pretraining, finetuning, and evaluation codes are available under: https://github.com/GMML.
引用
收藏
页码:2125 / 2129
页数:5
相关论文
共 18 条
[1]   Masked Siamese Networks for Label-Efficient Learning [J].
Assran, Mahmoud ;
Caron, Mathilde ;
Misra, Ishan ;
Bojanowski, Piotr ;
Bordes, Florian ;
Vincent, Pascal ;
Joulin, Armand ;
Rabbat, Mike ;
Ballas, Nicolas .
COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 :456-473
[2]  
Bardes Adrien, 2022, ICLR
[3]  
Cao Hu, 2022, EUR C COMP VIS
[4]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[5]   Safe Model-Free Optimal Voltage Control via Continuous-Time Zeroth-Order Methods [J].
Chen, Xin ;
Poveda, Jorge, I ;
Li, N. .
2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, :4064-4070
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Dosovitskiy A., 2021, ICLR, P1
[9]  
Grill J.-B., 2020, Advances in Neural Information Processing Systems, P21271
[10]   Masked Autoencoders Are Scalable Vision Learners [J].
He, Kaiming ;
Chen, Xinlei ;
Xie, Saining ;
Li, Yanghao ;
Dollar, Piotr ;
Girshick, Ross .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15979-15988