Combined scaling for zero-shot transfer learning

被引:23
|
作者
Pham, Hieu [1 ]
Dai, Zihang [1 ]
Ghiasi, Golnaz [1 ]
Kawaguchi, Kenji [2 ]
Liu, Hanxiao [1 ]
Yu, Adams Wei [1 ]
Yu, Jiahui [1 ]
Chen, Yi-Ting [1 ]
Luong, Minh-Thang [1 ]
Wu, Yonghui [1 ]
Tan, Mingxing [1 ]
V. Le, Quoc [1 ]
机构
[1] Brain Team, Google Res, Mountain View, CA USA
[2] Harvard Univ, Cambridge, MA 02138 USA
关键词
Deep learning; Computer vision; Deep neural networks; Zero-shot transfer; INFORMED NEURAL-NETWORKS; MODELS;
D O I
10.1016/j.neucom.2023.126658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent developments in multimodal training methodologies, including CLIP and ALIGN, obviate the necessity for individual data labeling. These approaches utilize pairs of data and corresponding textual information found online as a form of weak supervision signal. However, models employing this kind of weak supervision are not as competitive as their supervised and semi-supervised counterparts when sufficient labeled data is accessible. This performance gap constrains the applicability of weekly supervised models. In this paper, we narrow the gap by proposing a combined scaling method, named BASIC, that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best-published similar models, CLIP and ALIGN, by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we first develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as CLIP and ALIGN. Based on this theoretical result, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions (data size, model size, and batch size) by proposing a new method using gradient checkpointing and model parallelism. As a result, our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN.
引用
收藏
页数:23
相关论文
共 50 条
  • [11] Transfer and zero-shot learning for scalable weed detection and classification in UAV images
    Belissent, Nicolas
    Pena, Jose M.
    Mesias-Ruiz, Gustavo A.
    Shawe-Taylor, John
    Perez-Ortiz, Maria
    KNOWLEDGE-BASED SYSTEMS, 2024, 292
  • [12] A Review of Generalized Zero-Shot Learning Methods
    Pourpanah, Farhad
    Abdar, Moloud
    Luo, Yuxuan
    Zhou, Xinlei
    Wang, Ran
    Lim, Chee Peng
    Wang, Xi-Zhao
    Wu, Q. M. Jonathan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (04) : 4051 - 4070
  • [13] Convolutional prototype learning for zero-shot recognition
    Liu, Zhizhe
    Zhang, Xingxing
    Zhu, Zhenfeng
    Zheng, Shuai
    Zhao, Yao
    Cheng, Jian
    IMAGE AND VISION COMPUTING, 2020, 98
  • [14] Content-Attribute Disentanglement for Generalized Zero-Shot Learning
    An, Yoojin
    Kim, Sangyeon
    Liang, Yuxuan
    Zimmermann, Roger
    Kim, Dongho
    Kim, Jihie
    IEEE ACCESS, 2022, 10 : 58320 - 58331
  • [15] Towards Zero-Shot Frame Semantic Parsing for Domain Scaling
    Bapna, Ankur
    Tur, Gokhan
    Hakkani-Tur, Dilek
    Heck, Larry
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2476 - 2480
  • [16] An Introduction to Zero-Shot Learning: An Essential Review
    Soysal, Omurhan A.
    Guzel, Mehmet Serdar
    2ND INTERNATIONAL CONGRESS ON HUMAN-COMPUTER INTERACTION, OPTIMIZATION AND ROBOTIC APPLICATIONS (HORA 2020), 2020, : 510 - 513
  • [17] LEARNING VISUALLY CONSISTENT LABEL EMBEDDINGS FOR ZERO-SHOT LEARNING
    Demirel, Berkan
    Cinbis, Ramazan Gokberk
    Ikizler-Cinbis, Nazli
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3656 - 3660
  • [18] Zero-Shot Learning and Classification of Steel Surface Defects
    Nagy, Amr M.
    Czuni, Laszlo
    FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
  • [19] A Contrastive Method for Continual Generalized Zero-Shot Learning
    Liang, Chen
    Fan, Wentao
    Liu, Xin
    Peng, Shu-Juan
    ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE. THEORY AND APPLICATIONS, IEA/AIE 2023, PT I, 2023, 13925 : 365 - 376
  • [20] Unmasking the Masked Face Using Zero-Shot Learning
    Singh, Pranjali
    Singh, Amritpal
    ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2021, 2022, 1534 : 563 - 585