CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel

被引：2

作者：

Li, Zhenxing ^{[1
]}

Cao, Qiang ^{[1
]}

Chen, Yajie ^{[2
]}

Yan, Wenrui ^{[3
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] Nanjing Univ Sci & Technol, Nanjing, Peoples R China

[3] Shanghai AI Lab, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023 | 2023年

关键词：

Parallel Computing; Deep Learning; Heterogeneous System;

D O I：

10.1145/3605573.3605647

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The parameters of deep learning (DL) have ballooned from millions to trillions over the past decade, thus cannot be fully placed in a limited GPU memory. Existing works offload the parameter-update stage of training DL model to CPU, thus leveraging CPU capability and memory to support training large-scale model. However, the stage running upon CPU could block its following stage running upon GPU, thus wasting expensive GPU-cycles. We first analyze the dataflow and workflow of DL training, and find that the backward stage and the parameter-update stage can be parallelized upon GPU and CPU respectively. To this end, we present a DL-training scheduling framework, CoTrain, to allocate a compute task and its corresponding data into GPU and CPU and to parallelize them effectively at coarse-grain and fine-grain way. Particularly, the fine-grained task-partition scheme allocates a portion of parameter-update stage to GPU according to data-reuse-distance, thus largely avoiding idleness of both GPU and CPU while reducing data movement between GPU and CPU. We build and evaluate CoTrain atop PyTorch under representative models. The results show that compared to the state-of-the-art ZeRO-Offload, CoTrain achieves 30.4% improvement in the training throughput while increasing model-size by up to 7% without changing the training semantics.

引用

页码：92 / 101

页数：10

共 32 条

[21] Paszke A, 2019, ADV NEUR IN, V32
[22] Capuchin: Tensor-based GPU Memory Management for Deep Learning
Peng, Xuan
Shi, Xuanhua
Dai, Hulin
Jin, Hai
Ma, Weiliang
Xiong, Qian
Yang, Fan
Qian, Xuehai
[J]. TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 891 - 905
[23] Peters M. E., 2018, P 2018 C N AM CHAPT, P2227
[24] ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
Rajbhandari, Samyam
Ruwase, Olatunji
Rasley, Jeff
Smith, Shaden
He, Yuxiong
[J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
[25] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Rajbhandari, Samyam
Rasley, Jeff
Ruwase, Olatunji
He, Yuxiong
[J]. PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
[26] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
Rasley, Jeff
Rajbhandari, Samyam
Ruwase, Olatunji
He, Yuxiong
[J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3505 - 3506
[27] Ren J, 2021, PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE, P551
[28] Vaswani A, 2017, ADV NEUR IN, V30
[29] SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
Wang, Linnan
Ye, Jinmian
Zhao, Yiyang
Wu, Wei
Li, Ang
Song, Shuaiwen Leon
Xu, Zenglin
Kraska, Tim
[J]. ACM SIGPLAN NOTICES, 2018, 53 (01) : 41 - 53
[30] iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures
Zhang, Chenyang
Zhang, Feng
Guo, Xiaoguang
He, Bingsheng
Zhang, Xiao
Du, Xiaoyong
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (07) : 1740 - 1752

← 1 2 3 4 →