CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel

被引:2
作者
Li, Zhenxing [1 ]
Cao, Qiang [1 ]
Chen, Yajie [2 ]
Yan, Wenrui [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023 | 2023年
关键词
Parallel Computing; Deep Learning; Heterogeneous System;
D O I
10.1145/3605573.3605647
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The parameters of deep learning (DL) have ballooned from millions to trillions over the past decade, thus cannot be fully placed in a limited GPU memory. Existing works offload the parameter-update stage of training DL model to CPU, thus leveraging CPU capability and memory to support training large-scale model. However, the stage running upon CPU could block its following stage running upon GPU, thus wasting expensive GPU-cycles. We first analyze the dataflow and workflow of DL training, and find that the backward stage and the parameter-update stage can be parallelized upon GPU and CPU respectively. To this end, we present a DL-training scheduling framework, CoTrain, to allocate a compute task and its corresponding data into GPU and CPU and to parallelize them effectively at coarse-grain and fine-grain way. Particularly, the fine-grained task-partition scheme allocates a portion of parameter-update stage to GPU according to data-reuse-distance, thus largely avoiding idleness of both GPU and CPU while reducing data movement between GPU and CPU. We build and evaluate CoTrain atop PyTorch under representative models. The results show that compared to the state-of-the-art ZeRO-Offload, CoTrain achieves 30.4% improvement in the training throughput while increasing model-size by up to 7% without changing the training semantics.
引用
收藏
页码:92 / 101
页数:10
相关论文
共 32 条
  • [21] Paszke A, 2019, ADV NEUR IN, V32
  • [22] Capuchin: Tensor-based GPU Memory Management for Deep Learning
    Peng, Xuan
    Shi, Xuanhua
    Dai, Hulin
    Jin, Hai
    Ma, Weiliang
    Xiong, Qian
    Yang, Fan
    Qian, Xuehai
    [J]. TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 891 - 905
  • [23] Peters M. E., 2018, P 2018 C N AM CHAPT, P2227
  • [24] ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
    Rajbhandari, Samyam
    Ruwase, Olatunji
    Rasley, Jeff
    Smith, Shaden
    He, Yuxiong
    [J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
  • [25] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
    Rajbhandari, Samyam
    Rasley, Jeff
    Ruwase, Olatunji
    He, Yuxiong
    [J]. PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [26] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
    Rasley, Jeff
    Rajbhandari, Samyam
    Ruwase, Olatunji
    He, Yuxiong
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3505 - 3506
  • [27] Ren J, 2021, PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE, P551
  • [28] Vaswani A, 2017, ADV NEUR IN, V30
  • [29] SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
    Wang, Linnan
    Ye, Jinmian
    Zhao, Yiyang
    Wu, Wei
    Li, Ang
    Song, Shuaiwen Leon
    Xu, Zenglin
    Kraska, Tim
    [J]. ACM SIGPLAN NOTICES, 2018, 53 (01) : 41 - 53
  • [30] iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures
    Zhang, Chenyang
    Zhang, Feng
    Guo, Xiaoguang
    He, Bingsheng
    Zhang, Xiao
    Du, Xiaoyong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (07) : 1740 - 1752