Combined scaling for zero-shot transfer learning

被引:23
作者
Pham, Hieu [1 ]
Dai, Zihang [1 ]
Ghiasi, Golnaz [1 ]
Kawaguchi, Kenji [2 ]
Liu, Hanxiao [1 ]
Yu, Adams Wei [1 ]
Yu, Jiahui [1 ]
Chen, Yi-Ting [1 ]
Luong, Minh-Thang [1 ]
Wu, Yonghui [1 ]
Tan, Mingxing [1 ]
V. Le, Quoc [1 ]
机构
[1] Brain Team, Google Res, Mountain View, CA USA
[2] Harvard Univ, Cambridge, MA 02138 USA
关键词
Deep learning; Computer vision; Deep neural networks; Zero-shot transfer; INFORMED NEURAL-NETWORKS; MODELS;
D O I
10.1016/j.neucom.2023.126658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent developments in multimodal training methodologies, including CLIP and ALIGN, obviate the necessity for individual data labeling. These approaches utilize pairs of data and corresponding textual information found online as a form of weak supervision signal. However, models employing this kind of weak supervision are not as competitive as their supervised and semi-supervised counterparts when sufficient labeled data is accessible. This performance gap constrains the applicability of weekly supervised models. In this paper, we narrow the gap by proposing a combined scaling method, named BASIC, that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best-published similar models, CLIP and ALIGN, by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we first develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as CLIP and ALIGN. Based on this theoretical result, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions (data size, model size, and batch size) by proposing a new method using gradient checkpointing and model parallelism. As a result, our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN.
引用
收藏
页数:23
相关论文
共 50 条
  • [31] Visual Structure Constraint for Transductive Zero-Shot Learning in the Wild
    Wan, Ziyu
    Chen, Dongdong
    Liao, Jing
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) : 1893 - 1909
  • [32] Visual Structure Constraint for Transductive Zero-Shot Learning in the Wild
    Ziyu Wan
    Dongdong Chen
    Jing Liao
    International Journal of Computer Vision, 2021, 129 : 1893 - 1909
  • [33] Research Progress of Zero-Shot Learning Beyond Computer Vision
    Cao, Weipeng
    Zhou, Cong
    Wu, Yuhao
    Ming, Zhong
    Xu, Zhiwu
    Zhang, Jiyong
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2020, PT II, 2020, 12453 : 538 - 551
  • [34] Recursive Training for Zero-Shot Semantic Segmentation
    Wang, Ce
    Farazi, Moshiur
    Barnes, Nick
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [35] Zero-Shot Defect Feature Optimizer: an efficient zero-shot optimization method for defect detection
    Yan, Zhibo
    Wu, Hanyang
    Aasim, Tehreem
    Yao, Haitao
    Zhang, Teng
    Wang, Dongyun
    JOURNAL OF ELECTRONIC IMAGING, 2025, 34 (01)
  • [36] Learning Latent Semantic Attributes for Zero-Shot Object Detection
    Wang, Kang
    Zhang, Lu
    Tan, Yifan
    Zhao, Jiajia
    Zhou, Shuigeng
    2020 IEEE 32ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2020, : 230 - 237
  • [37] Dual VAEGAN: A generative model for generalized zero-shot learning
    Luo, Yuxuan
    Wang, Xizhao
    Pourpanah, Farhad
    APPLIED SOFT COMPUTING, 2021, 107
  • [38] Ricci Planner: Zero-Shot Transfer for Goal-Conditioned Reinforcement Learning via Geometric Flow
    Song, Wongeun
    Lee, Jungwoo
    IEEE ACCESS, 2024, 12 : 24027 - 24038
  • [39] An Adversarial Learning Framework for Zero-shot Fault Recognition of Mechanical Systems
    Chen, Jinglong
    Pan, Tongyang
    Zhou, Zitong
    He, Shuilong
    2019 IEEE 17TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2019, : 1275 - 1278
  • [40] Attribute-Based Zero-Shot Learning for Encrypted Traffic Classification
    Hu, Ying
    Cheng, Guang
    Chen, Wenchao
    Jiang, Bomiao
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 4583 - 4599