SLIP: Self-supervision Meets Language-Image Pre-training

被引:147
作者
Mu, Norman [1 ]
Kirillov, Alexander [2 ]
Wagner, David [1 ]
Xie, Saining [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Meta AI, Menlo Pk, CA USA
来源
COMPUTER VISION, ECCV 2022, PT XXVI | 2022年 / 13686卷
关键词
D O I
10.1007/978-3-031-19809-0_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.
引用
收藏
页码:529 / 544
页数:16
相关论文
共 39 条
[1]  
Bao H., 2021, arXiv
[2]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[3]   Deep Clustering for Unsupervised Learning of Visual Features [J].
Caron, Mathilde ;
Bojanowski, Piotr ;
Joulin, Armand ;
Douze, Matthijs .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156
[4]   Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].
Changpinyo, Soravit ;
Sharma, Piyush ;
Ding, Nan ;
Soricut, Radu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567
[5]  
Chen T, 2020, Arxiv, DOI arXiv:2002.05709
[6]  
Chen T, 2020, Arxiv, DOI arXiv:2006.10029
[7]  
Chen XL, 2021, Arxiv, DOI arXiv:2104.02057
[8]   VirTex: Learning Visual Representations from Textual Annotations [J].
Desai, Karan ;
Johnson, Justin .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11157-11168
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]