GMML IS ALL YOU NEED

被引:4
作者
Atito, Sara [1 ,2 ]
Awais, Muhammed [1 ,2 ]
Nandam, Srinivasa [2 ]
Kittler, Josef [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England
[2] Univ Surrey, Surrey Inst People Centred AI, Guildford GU2 7XH, Surrey, England
来源
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年
基金
英国工程与自然科学研究理事会;
关键词
Self-supervised Learning; Vision Transformers; Group Masked Model Learning; Deep Learning;
D O I
10.1109/ICIP49359.2023.10222150
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers (ViTs) have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry and therefore often pretrained on large-scale datasets, e.g. JFT-300M or ImageNet. An ideal learning method would perform best regardless of the size of the dataset, a property lacked by current learning methods, with merely a few existing works studying ViTs with limited data. We propose Group Masked Model Learning (GMML), a self-supervised learning (SSL) method that is able to train ViTs and achieve state-of-the-art (SOTA) performance when pre-trained with limited data. The GMML uses the information conveyed by all concepts in the image. This is achieved by manipulating randomly groups of connected tokens, successively covering different meaningful parts of the image content, and then recovering the hidden information from the visible part of the concept. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor relies on careful implementation details such as large batches and gradient stopping. Pretraining, finetuning, and evaluation codes are available under: https://github.com/GMML.
引用
收藏
页码:2125 / 2129
页数:5
相关论文
共 18 条
[11]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735
[12]  
Khosla P, 2020, ADV NEUR IN, V33
[13]   PRF-RW: a progressive random forest-based random walk approach for interactive semi-automated pulmonary lobes segmentation [J].
Li, Qiang ;
Chen, Lei ;
Li, Xiangju ;
Lv, Xiaofeng ;
Xia, Shuyue ;
Kang, Yan .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2020, 11 (10) :2221-2235
[14]   Context Encoders: Feature Learning by Inpainting [J].
Pathak, Deepak ;
Krahenbuhl, Philipp ;
Donahue, Jeff ;
Darrell, Trevor ;
Efros, Alexei A. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2536-2544
[15]  
Touvron H, 2021, PR MACH LEARN RES, V139, P7358
[16]  
Vaswani A, 2017, ADV NEUR IN, V30
[17]   Harmonic and inharmonic Nonnegative Matrix Factorization for polyphonic pitch transcription [J].
Vincent, Emmanuel ;
Bertin, Nancy ;
Badeau, Roland .
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :109-+
[18]  
Zbontar J, 2021, PR MACH LEARN RES, V139