DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

被引:9
作者
Chen, Yihao [1 ]
Qi, Xianbiao [1 ]
Wang, Jianan [1 ]
Zhang, Lei [1 ]
机构
[1] Int Digital Econ Acad IDEA, Shenzhen, Guangdong, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.02169
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from O(B-2) to O(B-2/N), where B and N are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCoCLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K.
引用
收藏
页码:22648 / 22657
页数:10
相关论文
共 48 条
[1]  
[Anonymous], 2020, J MACHINE LEARNING R, DOI DOI 10.1038/S41467-020-15299-5
[2]  
[Anonymous], 2020, SC, DOI DOI 10.1109/SC41405.2020.00024
[3]  
[Anonymous], 2020, ECCV
[4]  
Bian Zhengda, 2021, ARXIV211014883
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]   Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].
Changpinyo, Soravit ;
Sharma, Piyush ;
Ding, Nan ;
Soricut, Radu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567
[7]  
Chen T., 2016, Training deep nets with sublinear memory cost
[8]   Safe Model-Free Optimal Voltage Control via Continuous-Time Zeroth-Order Methods [J].
Chen, Xin ;
Poveda, Jorge, I ;
Li, N. .
2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, :4064-4070
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171