A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

被引:0
作者
Jiang, Yimin [1 ,2 ]
Zhu, Yibo [2 ]
Lan, Chang [3 ]
Yi, Bairen [2 ]
Cui, Yong [1 ]
Guo, Chuanxiong [2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] ByteDance, Beijing, Peoples R China
[3] Google, Mountain View, CA USA
来源
PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20) | 2020年
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Data center clusters that run DNN training jobs are inherently heterogeneous. They have GPUs and CPUs for computation and network bandwidth for distributed training. However, existing distributed DNN training architectures, all-reduce and Parameter Server (PS), cannot fully utilize such heterogeneous resources. In this paper, we present a new distributed DNN training architecture called BytePS. BytePS can leverage spare CPU and bandwidth resources in the cluster to accelerate distributed DNN training tasks running on GPUs. It provides a communication framework that is both proved optimal and unified - existing all-reduce and PS become two special cases of BytePS. To achieve the proved optimality in practice, BytePS further splits the functionalities of a parameter optimizer. It introduces a Summation Service abstraction for aggregating gradients, which is common for all the optimizers. Summation Service can be accelerated by AVX instructions and can be efficiently run on CPUs, while DNN model-related optimizer algorithms are run on GPUs for computation acceleration. BytePS can accelerate DNN training for major frameworks including TensorFlow, PyTorch and MXNet. For representative DNN training jobs with up to 256 GPUs, BytePS outperforms the state-of-the-art open source all-reduce and PS by up to 84% and 245%, respectively.
引用
收藏
页码:463 / 479
页数:17
相关论文
共 50 条
[31]   A Distributed PTX Compilation and Execution System on Hybrid CPU/GPU Clusters [J].
Liang, Tyng-Yeu ;
Li, Hung-Fu ;
Chen, Bi-Shing .
INTELLIGENT SYSTEMS AND APPLICATIONS (ICS 2014), 2015, 274 :1355-1364
[32]   Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture [J].
Wen, Hao ;
Zhang, Wei .
2019 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2019,
[33]   A Distributed Framework for Subgraph Isomorphism Leveraging CPU and GPU Heterogeneous Computing [J].
Chen, Chen ;
Shen, Li ;
Chen, Yingwen .
53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024, 2024, :433-442
[34]   Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster [J].
Jie Xu ;
Jingyu Wang ;
Qi Qi ;
Haifeng Sun ;
Jianxin Liao ;
Di Yang .
Journal of Grid Computing, 2021, 19
[35]   Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster [J].
Xu, Jie ;
Wang, Jingyu ;
Qi, Qi ;
Sun, Haifeng ;
Liao, Jianxin ;
Yang, Di .
JOURNAL OF GRID COMPUTING, 2021, 19 (01)
[36]   Scalable Distributed DNN Training Using Commodity GPU Cloud Computing [J].
Strom, Nikko .
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, :1488-1492
[37]   Accelerating Training of DNN in Distributed Machine Learning System with Shared Memory [J].
Lim, Eun-Ji ;
Ahn, Shin-Young ;
Choi, Wan .
2017 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2017, :1209-1212
[38]   PipePar: A Pipelined Hybrid Parallel Approach for Accelerating Distributed DNN Training [J].
Li, Jiange ;
Wang, Yuchen ;
Zhang, Jinghui ;
Jin, Jiahui ;
Dong, Fang ;
Qian, Lei .
PROCEEDINGS OF THE 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2021, :470-475
[39]   ACCELERATING LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION ON HETEROGENEOUS CPU-GPU PLATFORMS [J].
Kim, Jungsuk ;
Lane, Ian .
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[40]   A load balancing method in accelerating Kriging algorithm on CPU-GPU heterogeneous platforms [J].
Jiang, Chunlei ;
Zhang, Shuqing .
Guofang Keji Daxue Xuebao/Journal of National University of Defense Technology, 2015, 37 (05) :35-39and148