A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

被引：0

作者：

Jiang, Yimin ^{[1
,2
]}

Zhu, Yibo ^{[2
]}

Lan, Chang ^{[3
]}

Yi, Bairen ^{[2
]}

Cui, Yong ^{[1
]}

Guo, Chuanxiong ^{[2
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] ByteDance, Beijing, Peoples R China

[3] Google, Mountain View, CA USA

来源：

PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20) | 2020年

关键词：

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Data center clusters that run DNN training jobs are inherently heterogeneous. They have GPUs and CPUs for computation and network bandwidth for distributed training. However, existing distributed DNN training architectures, all-reduce and Parameter Server (PS), cannot fully utilize such heterogeneous resources. In this paper, we present a new distributed DNN training architecture called BytePS. BytePS can leverage spare CPU and bandwidth resources in the cluster to accelerate distributed DNN training tasks running on GPUs. It provides a communication framework that is both proved optimal and unified - existing all-reduce and PS become two special cases of BytePS. To achieve the proved optimality in practice, BytePS further splits the functionalities of a parameter optimizer. It introduces a Summation Service abstraction for aggregating gradients, which is common for all the optimizers. Summation Service can be accelerated by AVX instructions and can be efficiently run on CPUs, while DNN model-related optimizer algorithms are run on GPUs for computation acceleration. BytePS can accelerate DNN training for major frameworks including TensorFlow, PyTorch and MXNet. For representative DNN training jobs with up to 256 GPUs, BytePS outperforms the state-of-the-art open source all-reduce and PS by up to 84% and 245%, respectively.

引用

页码：463 / 479

页数：17

共 50 条

[21] A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters [J].

Liang, Tyng-Yeu ;

Li, Hung-Fu ;

Lin, Yu-Jie ;

Chen, Bi-Shing .

JOURNAL OF SYSTEMS ARCHITECTURE, 2016, 62 :63-77

[22] An Optimization of FMM under CPU plus GPU Heterogeneous Architecture [J].

Zhu, Yonghua ;

Lu, Xiao .

PROCEEDINGS OF THE 2012 IEEE 14TH INTERNATIONAL CONFERENCE ON COMMERCE AND ENTERPRISE COMPUTING (CEC 2012), 2012, :147-150

[23] Accelerating Distributed DNN Training via Transport Layer Scheduling [J].

Duan, Qingyang ;

Peng, Chao ;

Wang, Zeqin ;

Xu, Yuedong ;

Liu, Shaoteng ;

Wu, Jun ;

Lui, John C. S. .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (05) :1650-1666

[24] Model Parameter Prediction Method for Accelerating Distributed DNN Training [J].

Liu, Wai-xi ;

Chen, Dao-xiao ;

Tan, Miao-quan ;

Chen, Kong-yang ;

Yin, Yue ;

Shang, Wen-Li ;

Li, Jin ;

Cai, Jun .

COMPUTER NETWORKS, 2024, 255

[25] An Approach Towards Distributed DNN Training on FPGA Clusters [J].

Kreowsky, Philipp ;

Knapheide, Justin ;

Stabernack, Benno .

ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2024, 2024, 14842 :18-32

[26] Accelerating Static Timing Analysis Using CPU-GPU Heterogeneous Parallelism [J].

Guo, Zizheng ;

Huang, Tsung-Wei ;

Lin, Yibo .

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (12) :4973-4984

[27] Accelerating Iterative Protein Sequence Alignment on a Heterogeneous GPU-CPU Platform [J].

Said, Mai ;

Safar, Mona ;

Taher, Whamed ;

Wahba, Ayman .

2016 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2016), 2016, :403-410

[28] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs [J].

Zheng, Da ;

Song, Xiang ;

Yang, Chengru ;

LaSalle, Dominique ;

Karypis, George .

PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, :4582-4591

[29] AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters [J].

Alnaasan, Nawras ;

Jain, Arpan ;

Shafi, Aamir ;

Subramoni, Hari ;

Panda, Dhabaleswar K. .

2022 IEEE 29TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC, 2022, :32-41

[30] Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters [J].

Liu, Yang ;

Ding, Nan ;

Sao, Piyush ;

Williams, Samuel ;

Li, Xiaoye Sherry .

SC23:INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2023,

← 1 2 3 4 5 →