Combined scaling for zero-shot transfer learning

被引:23
|
作者
Pham, Hieu [1 ]
Dai, Zihang [1 ]
Ghiasi, Golnaz [1 ]
Kawaguchi, Kenji [2 ]
Liu, Hanxiao [1 ]
Yu, Adams Wei [1 ]
Yu, Jiahui [1 ]
Chen, Yi-Ting [1 ]
Luong, Minh-Thang [1 ]
Wu, Yonghui [1 ]
Tan, Mingxing [1 ]
V. Le, Quoc [1 ]
机构
[1] Brain Team, Google Res, Mountain View, CA USA
[2] Harvard Univ, Cambridge, MA 02138 USA
关键词
Deep learning; Computer vision; Deep neural networks; Zero-shot transfer; INFORMED NEURAL-NETWORKS; MODELS;
D O I
10.1016/j.neucom.2023.126658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent developments in multimodal training methodologies, including CLIP and ALIGN, obviate the necessity for individual data labeling. These approaches utilize pairs of data and corresponding textual information found online as a form of weak supervision signal. However, models employing this kind of weak supervision are not as competitive as their supervised and semi-supervised counterparts when sufficient labeled data is accessible. This performance gap constrains the applicability of weekly supervised models. In this paper, we narrow the gap by proposing a combined scaling method, named BASIC, that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best-published similar models, CLIP and ALIGN, by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we first develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as CLIP and ALIGN. Based on this theoretical result, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions (data size, model size, and batch size) by proposing a new method using gradient checkpointing and model parallelism. As a result, our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An Overview
    Ren, Wenqi
    Tang, Yang
    Sun, Qiyu
    Zhao, Chaoqiang
    Han, Qing-Long
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2024, 11 (05) : 1106 - 1126
  • [22] A study on zero-shot learning from semantic viewpoint
    P K Bhagat
    Prakash Choudhary
    Kh Manglem Singh
    The Visual Computer, 2023, 39 : 2149 - 2163
  • [23] Zero-shot learning for requirements classification: An exploratory study
    Alhoshan, Waad
    Ferrari, Alessio
    Zhao, Liping
    INFORMATION AND SOFTWARE TECHNOLOGY, 2023, 159
  • [24] A study on zero-shot learning from semantic viewpoint
    Bhagat, P. K.
    Choudhary, Prakash
    Singh, Kh Manglem
    VISUAL COMPUTER, 2023, 39 (05) : 2149 - 2163
  • [25] Complementary Attributes: A New Clue to Zero-Shot Learning
    Xu, Xiaofeng
    Tsang, Ivor W.
    Liu, Chuancai
    IEEE TRANSACTIONS ON CYBERNETICS, 2021, 51 (03) : 1519 - 1530
  • [26] Zero-Shot and Few-Shot Learning With Knowledge Graphs: A Comprehensive Survey
    Chen, Jiaoyan
    Geng, Yuxia
    Chen, Zhuo
    Pan, Jeff Z. Z.
    He, Yuan
    Zhang, Wen
    Horrocks, Ian
    Chen, Huajun
    PROCEEDINGS OF THE IEEE, 2023, 111 (06) : 653 - 685
  • [27] A Zero-Shot Learning Approach to Classifying Requirements: A Preliminary Study
    Alhoshan, Waad
    Zhao, Liping
    Ferrari, Alessio
    Letsholo, Keletso J.
    REQUIREMENTS ENGINEERING: FOUNDATION FOR SOFTWARE QUALITY, REFSQ 2022, 2022, 13216 : 52 - 59
  • [28] Disentangling Semantic-to-Visual Confusion for Zero-Shot Learning
    Ye, Zihan
    Hu, Fuyuan
    Lyu, Fan
    Li, Linyan
    Huang, Kaizhu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2828 - 2840
  • [29] Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble
    Li, Xinjian
    Metze, Florian
    Mortensen, David R.
    Watanabe, Shinji
    Black, Alan W.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2106 - 2115
  • [30] Grouping attributes zero-shot learning for tongue constitution recognition
    Wen, Guihua
    Ma, Jiajiong
    Hu, Yang
    Li, Huihui
    Jiang, Lijun
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2020, 109