Combined scaling for zero-shot transfer learning

被引:22
|
作者
Pham, Hieu [1 ]
Dai, Zihang [1 ]
Ghiasi, Golnaz [1 ]
Kawaguchi, Kenji [2 ]
Liu, Hanxiao [1 ]
Yu, Adams Wei [1 ]
Yu, Jiahui [1 ]
Chen, Yi-Ting [1 ]
Luong, Minh-Thang [1 ]
Wu, Yonghui [1 ]
Tan, Mingxing [1 ]
V. Le, Quoc [1 ]
机构
[1] Brain Team, Google Res, Mountain View, CA USA
[2] Harvard Univ, Cambridge, MA 02138 USA
关键词
Deep learning; Computer vision; Deep neural networks; Zero-shot transfer; INFORMED NEURAL-NETWORKS; MODELS;
D O I
10.1016/j.neucom.2023.126658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent developments in multimodal training methodologies, including CLIP and ALIGN, obviate the necessity for individual data labeling. These approaches utilize pairs of data and corresponding textual information found online as a form of weak supervision signal. However, models employing this kind of weak supervision are not as competitive as their supervised and semi-supervised counterparts when sufficient labeled data is accessible. This performance gap constrains the applicability of weekly supervised models. In this paper, we narrow the gap by proposing a combined scaling method, named BASIC, that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best-published similar models, CLIP and ALIGN, by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we first develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as CLIP and ALIGN. Based on this theoretical result, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions (data size, model size, and batch size) by proposing a new method using gradient checkpointing and model parallelism. As a result, our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN.
引用
收藏
页数:23
相关论文
共 50 条
  • [31] Zero-Shot Program Representation Learning
    Cui, Nan
    Jiang, Yuze
    Gu, Xiaodong
    Shen, Beijun
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 60 - 70
  • [32] Research progress of zero-shot learning
    Sun, Xiaohong
    Gu, Jinan
    Sun, Hongying
    APPLIED INTELLIGENCE, 2021, 51 (06) : 3600 - 3614
  • [33] Research progress of zero-shot learning
    Xiaohong Sun
    Jinan Gu
    Hongying Sun
    Applied Intelligence, 2021, 51 : 3600 - 3614
  • [34] Joint Dictionaries for Zero-Shot Learning
    Kolouri, Soheil
    Rostami, Mohammad
    Owechko, Yuri
    Kim, Kyungnam
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 3431 - 3439
  • [35] Creativity Inspired Zero-Shot Learning
    Elhoseiny, Mohamed
    Elfeki, Mohamed
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5783 - 5792
  • [36] Synthesized Classifiers for Zero-Shot Learning
    Changpinyo, Soravit
    Chao, Wei-Lun
    Gong, Boqing
    Sha, Fei
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5327 - 5336
  • [37] Zero-Shot Learning With Transferred Samples
    Guo, Yuchen
    Ding, Guiguang
    Han, Jungong
    Gao, Yue
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (07) : 3277 - 3290
  • [38] LVQ Treatment for Zero-Shot Learning
    Ismailoglu, Firat
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2023, 31 (01) : 216 - 237
  • [39] Attribute subspaces for zero-shot learning
    Zhou, Lei
    Liu, Yang
    Bai, Xiao
    Li, Na
    Yu, Xiaohan
    Zhou, Jun
    Hancock, Edwin R.
    PATTERN RECOGNITION, 2023, 144
  • [40] A review on multimodal zero-shot learning
    Cao, Weipeng
    Wu, Yuhao
    Sun, Yixuan
    Zhang, Haigang
    Ren, Jin
    Gu, Dujuan
    Wang, Xingkai
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2023, 13 (02)