CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel

被引:2
作者
Li, Zhenxing [1 ]
Cao, Qiang [1 ]
Chen, Yajie [2 ]
Yan, Wenrui [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023 | 2023年
关键词
Parallel Computing; Deep Learning; Heterogeneous System;
D O I
10.1145/3605573.3605647
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The parameters of deep learning (DL) have ballooned from millions to trillions over the past decade, thus cannot be fully placed in a limited GPU memory. Existing works offload the parameter-update stage of training DL model to CPU, thus leveraging CPU capability and memory to support training large-scale model. However, the stage running upon CPU could block its following stage running upon GPU, thus wasting expensive GPU-cycles. We first analyze the dataflow and workflow of DL training, and find that the backward stage and the parameter-update stage can be parallelized upon GPU and CPU respectively. To this end, we present a DL-training scheduling framework, CoTrain, to allocate a compute task and its corresponding data into GPU and CPU and to parallelize them effectively at coarse-grain and fine-grain way. Particularly, the fine-grained task-partition scheme allocates a portion of parameter-update stage to GPU according to data-reuse-distance, thus largely avoiding idleness of both GPU and CPU while reducing data movement between GPU and CPU. We build and evaluate CoTrain atop PyTorch under representative models. The results show that compared to the state-of-the-art ZeRO-Offload, CoTrain achieves 30.4% improvement in the training throughput while increasing model-size by up to 7% without changing the training semantics.
引用
收藏
页码:92 / 101
页数:10
相关论文
共 32 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Bae J, 2021, PROCEEDINGS OF THE 19TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST '21), P387
  • [3] Large-Scale Machine Learning with Stochastic Gradient Descent
    Bottou, Leon
    [J]. COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, : 177 - 186
  • [4] Brown TB, 2020, ADV NEUR IN, V33
  • [5] Cao Y, 2020, Arxiv, DOI arXiv:2010.07576
  • [6] Chen TQ, 2015, Arxiv, DOI [arXiv:1512.01274, DOI 10.48550/ARXIV.1512.01274]
  • [7] Chen XM, 2018, DES AUT TEST EUROPE, P13, DOI 10.23919/DATE.2018.8341972
  • [8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [9] AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming
    Hildebrand, Mark
    Khan, Jawad
    Trika, Sanjeev
    Lowe-Power, Jason
    Akella, Venkatesh
    [J]. TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 875 - 890
  • [10] Densely Connected Convolutional Networks
    Huang, Gao
    Liu, Zhuang
    van der Maaten, Laurens
    Weinberger, Kilian Q.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2261 - 2269