CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel

被引：2

作者：

Li, Zhenxing ^{[1
]}

Cao, Qiang ^{[1
]}

Chen, Yajie ^{[2
]}

Yan, Wenrui ^{[3
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] Nanjing Univ Sci & Technol, Nanjing, Peoples R China

[3] Shanghai AI Lab, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023 | 2023年

关键词：

Parallel Computing; Deep Learning; Heterogeneous System;

D O I：

10.1145/3605573.3605647

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The parameters of deep learning (DL) have ballooned from millions to trillions over the past decade, thus cannot be fully placed in a limited GPU memory. Existing works offload the parameter-update stage of training DL model to CPU, thus leveraging CPU capability and memory to support training large-scale model. However, the stage running upon CPU could block its following stage running upon GPU, thus wasting expensive GPU-cycles. We first analyze the dataflow and workflow of DL training, and find that the backward stage and the parameter-update stage can be parallelized upon GPU and CPU respectively. To this end, we present a DL-training scheduling framework, CoTrain, to allocate a compute task and its corresponding data into GPU and CPU and to parallelize them effectively at coarse-grain and fine-grain way. Particularly, the fine-grained task-partition scheme allocates a portion of parameter-update stage to GPU according to data-reuse-distance, thus largely avoiding idleness of both GPU and CPU while reducing data movement between GPU and CPU. We build and evaluate CoTrain atop PyTorch under representative models. The results show that compared to the state-of-the-art ZeRO-Offload, CoTrain achieves 30.4% improvement in the training throughput while increasing model-size by up to 7% without changing the training semantics.

引用

页码：92 / 101

页数：10

共 32 条

[1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2] Bae J, 2021, PROCEEDINGS OF THE 19TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST '21), P387
[3] Large-Scale Machine Learning with Stochastic Gradient Descent
Bottou, Leon
[J]. COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, : 177 - 186
[4] Brown TB, 2020, ADV NEUR IN, V33
[5] Cao Y, 2020, Arxiv, DOI arXiv:2010.07576
[6] Chen TQ, 2015, Arxiv, DOI [arXiv:1512.01274, DOI 10.48550/ARXIV.1512.01274]
[7] Chen XM, 2018, DES AUT TEST EUROPE, P13, DOI 10.23919/DATE.2018.8341972
[8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9] AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming
Hildebrand, Mark
Khan, Jawad
Trika, Sanjeev
Lowe-Power, Jason
Akella, Venkatesh
[J]. TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 875 - 890
[10] Densely Connected Convolutional Networks
Huang, Gao
Liu, Zhuang
van der Maaten, Laurens
Weinberger, Kilian Q.
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2261 - 2269

← 1 2 3 4 →