A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

被引:0
|
作者
Jiang, Yimin [1 ,2 ]
Zhu, Yibo [2 ]
Lan, Chang [3 ]
Yi, Bairen [2 ]
Cui, Yong [1 ]
Guo, Chuanxiong [2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] ByteDance, Beijing, Peoples R China
[3] Google, Mountain View, CA USA
来源
PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20) | 2020年
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Data center clusters that run DNN training jobs are inherently heterogeneous. They have GPUs and CPUs for computation and network bandwidth for distributed training. However, existing distributed DNN training architectures, all-reduce and Parameter Server (PS), cannot fully utilize such heterogeneous resources. In this paper, we present a new distributed DNN training architecture called BytePS. BytePS can leverage spare CPU and bandwidth resources in the cluster to accelerate distributed DNN training tasks running on GPUs. It provides a communication framework that is both proved optimal and unified - existing all-reduce and PS become two special cases of BytePS. To achieve the proved optimality in practice, BytePS further splits the functionalities of a parameter optimizer. It introduces a Summation Service abstraction for aggregating gradients, which is common for all the optimizers. Summation Service can be accelerated by AVX instructions and can be efficiently run on CPUs, while DNN model-related optimizer algorithms are run on GPUs for computation acceleration. BytePS can accelerate DNN training for major frameworks including TensorFlow, PyTorch and MXNet. For representative DNN training jobs with up to 256 GPUs, BytePS outperforms the state-of-the-art open source all-reduce and PS by up to 84% and 245%, respectively.
引用
收藏
页码:463 / 479
页数:17
相关论文
共 50 条
  • [1] EDDIS: Accelerating Distributed Data -Parallel DNN Training for Heterogeneous GPU Cluster
    Ahn, Shinyoung
    Ahn, Hooyoung
    Choi, Hyeonseong
    Lee, Jaehyun
    2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 1167 - 1168
  • [2] OpenCL as a Unified Programming Model for Heterogeneous CPU/GPU Clusters
    Kim, Jungwon
    Seo, Sangmin
    Lee, Jun
    Nah, Jeongho
    Jo, Gangwon
    Lee, Jaejin
    ACM SIGPLAN NOTICES, 2012, 47 (08) : 299 - 300
  • [3] Benchmarking of High Performance Computing Clusters with Heterogeneous CPU/GPU Architecture
    Sukharev, Pavel V.
    Vasilyev, Nikolay P.
    Rovnyagin, Mikhail M.
    Durnov, Maxim A.
    PROCEEDINGS OF THE 2017 IEEE RUSSIA SECTION YOUNG RESEARCHERS IN ELECTRICAL AND ELECTRONIC ENGINEERING CONFERENCE (2017 ELCONRUS), 2017, : 574 - 577
  • [4] Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs
    Jain, Arpan
    Alnaasan, Nawras
    Shafi, Aamir
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2021 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2021), 2021, : 17 - 24
  • [5] PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters
    Zhang, Jinghui
    Niu, Geng
    Dai, Qiangsheng
    Li, Haorui
    Wu, Zhihua
    Dong, Fang
    Wu, Zhiang
    NEUROCOMPUTING, 2023, 555
  • [6] Accelerating MapReduce on a Coupled CPU-GPU Architecture
    Chen, Linchuan
    Huo, Xin
    Agrawal, Gagan
    2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
  • [7] A Unified CPU-GPU Protocol for GNN Training
    Lin, Yi-Chien
    Deng, Gangda
    Prasanna, Viktor
    PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2024, CF 2024, 2024, : 155 - 163
  • [8] GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
    Chen, Cen
    Li, Kenli
    Ouyang, Aijia
    Zeng, Zeng
    Li, Keqin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (06) : 1275 - 1288
  • [9] GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
    Chen, Cen
    Li, Kenli
    Ouyang, Aijia
    Tang, Zhuo
    Li, Keqin
    PROCEEDINGS 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - ICPP 2016, 2016, : 542 - 551
  • [10] Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
    Marques, Jose
    Falcao, Gabriel
    Alexandre, Luis A.
    APPLIED ARTIFICIAL INTELLIGENCE, 2018, 32 (9-10) : 822 - 844