Combined scaling for zero-shot transfer learning

被引:22
|
作者
Pham, Hieu [1 ]
Dai, Zihang [1 ]
Ghiasi, Golnaz [1 ]
Kawaguchi, Kenji [2 ]
Liu, Hanxiao [1 ]
Yu, Adams Wei [1 ]
Yu, Jiahui [1 ]
Chen, Yi-Ting [1 ]
Luong, Minh-Thang [1 ]
Wu, Yonghui [1 ]
Tan, Mingxing [1 ]
V. Le, Quoc [1 ]
机构
[1] Brain Team, Google Res, Mountain View, CA USA
[2] Harvard Univ, Cambridge, MA 02138 USA
关键词
Deep learning; Computer vision; Deep neural networks; Zero-shot transfer; INFORMED NEURAL-NETWORKS; MODELS;
D O I
10.1016/j.neucom.2023.126658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent developments in multimodal training methodologies, including CLIP and ALIGN, obviate the necessity for individual data labeling. These approaches utilize pairs of data and corresponding textual information found online as a form of weak supervision signal. However, models employing this kind of weak supervision are not as competitive as their supervised and semi-supervised counterparts when sufficient labeled data is accessible. This performance gap constrains the applicability of weekly supervised models. In this paper, we narrow the gap by proposing a combined scaling method, named BASIC, that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best-published similar models, CLIP and ALIGN, by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we first develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as CLIP and ALIGN. Based on this theoretical result, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions (data size, model size, and batch size) by proposing a new method using gradient checkpointing and model parallelism. As a result, our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Bangla Sign alphabet recognition with zero-shot and transfer learning
    Nihal, Ragib Amin
    Rahman, Sejuti
    Broti, Nawara Mahmood
    Deowan, Shamim Ahmed
    PATTERN RECOGNITION LETTERS, 2021, 150 : 84 - 93
  • [22] Zero-Shot Cross-Lingual Transfer with Meta Learning
    Nooralahzadeh, Farhad
    Bekoulis, Giannis
    Bjerva, Johannes
    Augenstein, Isabelle
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4547 - 4562
  • [23] Pseudo Transfer with Marginalized Corrupted Attribute for Zero-shot Learning
    Long, Teng
    Xu, Xing
    Li, Youyou
    Shen, Fumin
    Song, Jingkuan
    Shen, Heng Tao
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1802 - 1810
  • [24] A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning
    Rahman, Shafin
    Khan, Salman
    Porikli, Fatih
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (11) : 5652 - 5667
  • [25] Dual-focus transfer network for zero-shot learning
    Jia, Zhen
    Zhang, Zhang
    Shan, Caifeng
    Wang, Liang
    Tan, Tieniu
    NEUROCOMPUTING, 2023, 541
  • [26] Zero-Shot Transfer Learning Based on Visual and Textual Resemblance
    Yang, Gang
    Xu, Jieping
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 353 - 362
  • [27] Scaling Human-Object Interaction Recognition through Zero-Shot Learning
    Shen, Liyue
    Yeung, Serena
    Hoffman, Judy
    Mori, Greg
    Li Fei-Fei
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1568 - 1576
  • [28] Learning semantic ambiguities for zero-shot learning
    Hanouti, Celina
    Le Borgne, Herve
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (26) : 40745 - 40759
  • [29] Learning semantic ambiguities for zero-shot learning
    Celina Hanouti
    Hervé Le Borgne
    Multimedia Tools and Applications, 2023, 82 : 40745 - 40759
  • [30] Practical Aspects of Zero-Shot Learning
    Saad, Elie
    Paprzycki, Marcin
    Ganzha, Maria
    COMPUTATIONAL SCIENCE, ICCS 2022, PT II, 2022, : 88 - 95