A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

被引：0

作者：

Jiang, Yimin ^{[1
,2
]}

Zhu, Yibo ^{[2
]}

Lan, Chang ^{[3
]}

Yi, Bairen ^{[2
]}

Cui, Yong ^{[1
]}

Guo, Chuanxiong ^{[2
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] ByteDance, Beijing, Peoples R China

[3] Google, Mountain View, CA USA

来源：

PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20) | 2020年

关键词：

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Data center clusters that run DNN training jobs are inherently heterogeneous. They have GPUs and CPUs for computation and network bandwidth for distributed training. However, existing distributed DNN training architectures, all-reduce and Parameter Server (PS), cannot fully utilize such heterogeneous resources. In this paper, we present a new distributed DNN training architecture called BytePS. BytePS can leverage spare CPU and bandwidth resources in the cluster to accelerate distributed DNN training tasks running on GPUs. It provides a communication framework that is both proved optimal and unified - existing all-reduce and PS become two special cases of BytePS. To achieve the proved optimality in practice, BytePS further splits the functionalities of a parameter optimizer. It introduces a Summation Service abstraction for aggregating gradients, which is common for all the optimizers. Summation Service can be accelerated by AVX instructions and can be efficiently run on CPUs, while DNN model-related optimizer algorithms are run on GPUs for computation acceleration. BytePS can accelerate DNN training for major frameworks including TensorFlow, PyTorch and MXNet. For representative DNN training jobs with up to 256 GPUs, BytePS outperforms the state-of-the-art open source all-reduce and PS by up to 84% and 245%, respectively.

引用

页码：463 / 479

页数：17

共 50 条

[41] A load balancing method in accelerating Kriging algorithm on CPU-GPU heterogeneous platforms [J].

Jiang, Chunlei ;

Zhang, Shuqing .

Guofang Keji Daxue Xuebao/Journal of National University of Defense Technology, 2015, 37 (05) :35-39and148

[42] A heterogeneous parallel implementation of the Markov clustering algorithm for large-scale biological networks on distributed CPU–GPU clusters [J].

You Fu ;

Wei Zhou .

The Journal of Supercomputing, 2022, 78 :9017-9037

[43] SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters [J].

Cao, Jing ;

Zhu, Zongwei ;

Zhou, Xuehai .

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, :94-102

[44] FlinkCL: An OpenCL-Based In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data [J].

Chen, Cen ;

Li, Kenli ;

Ouyang, Aijia ;

Li, Keqin .

IEEE TRANSACTIONS ON COMPUTERS, 2018, 67 (12) :1765-1779

[45] Accelerating Inclusion-based Pointer Analysis on Heterogeneous CPU-GPU Systems [J].

Su, Yu ;

Ye, Ding ;

Xue, Jingling .

2013 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2013, :149-158

[46] RabbitSAlign: Accelerating Short-Read Alignment for CPU-GPU Heterogeneous Platforms [J].

Yan, Lifeng ;

Yin, Zekun ;

Li, Jinjin ;

Yang, Yang ;

Zhang, Tong ;

Zhu, Fangjin ;

Duan, Xiaohui ;

Schmidt, Bertil ;

Liu, Weiguo .

BIOINFORMATICS RESEARCH AND APPLICATIONS, PT II, ISBRA 2024, 2024, 14955 :83-94

[47] HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism [J].

Guo, Zizheng ;

Huang, Tsung-Wei ;

Lin, Yibo .

2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,

[48] Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters [J].

Lu, Fengshun ;

Song, Junqiang ;

Yin, Fukang ;

Zhu, Xiaoqian .

COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (06) :1172-1181

[49] ASW: Accelerating Smith-Waterman Algorithm on Coupled CPU-GPU Architecture [J].

Zou, Huihui ;

Tang, Shanjiang ;

Yu, Ce ;

Fu, Hao ;

Li, Yusen ;

Tang, Wenjie .

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (03) :388-402

[50] GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms [J].

Zeng, Hanqing ;

Prasanna, Viktor .

2020 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA '20), 2020, :255-265

← 1 2 3 4 5 →