US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

被引：2

作者：

Gao, Yunqi ^{[1
]}

Hu, Bing ^{[1
]}

Mashhadi, Mahdi Boloursaz ^{[2
]}

Jin, A-Long ^{[3
]}

Xiao, Pei ^{[2
]}

Wu, Chunming ^{[1
]}

机构：

[1] Zhejiang Univ, Hangzhou 310027, Peoples R China

[2] Univ Surrey, Inst Commun Syst ICS, 5GIC & 6GIC, Guildford GU2 7XH, England

[3] Univ Hong Kong, Pokfulam, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2024年 / 35卷 / 01期

关键词：

Tensors; Training; Parallel processing; Deep learning; Computer architecture; Backpropagation; Scheduling; Communication scheduling; data parallelism; distributed deep learning; tensor fusion; tensor partitioning;

D O I：

10.1109/TPDS.2023.3331372

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches based on tensor partitioning are not efficient and suffer from two challenges: 1) the fixed number of tensor blocks transferred in parallel can not necessarily minimize the communication overheads; 2) although the scheduling order that preferentially transmits tensor blocks close to the input layer can start forward propagation in the next iteration earlier, the shortest per-iteration time is not obtained. In this paper, we propose an efficient communication framework called US-Byte. It can schedule unequal-sized tensor blocks in a near-optimal order to minimize the training time. We build the mathematical model of US-Byte by two phases: 1) the overlap of gradient communication and backward propagation, and 2) the overlap of gradient communication and forward propagation. We theoretically derive the optimal solution for the second phase and efficiently solve the first phase with a low-complexity algorithm. We implement the US-Byte architecture on PyTorch framework. Extensive experiments on two different 8-node GPU clusters demonstrate that US-Byte can achieve up to 1.26x and 1.56x speedup compared to ByteScheduler and WFBP, respectively. We further exploit simulations of 128 GPUs to verify the potential scaling performance of US-Byte. Simulation results show that US-Byte can achieve up to 1.69x speedup compared to the state-of-the-art communication framework.

引用

页码：123 / 139

页数：17

共 55 条

[1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2] Aji A. F, 2020, P C EMP METH NAT LAN, P172
[3] Alistarh D, 2017, ADV NEUR IN, V30
[4] [Anonymous], 2011, P 49 ANN M ASS COMPU
[5] Bao YX, 2020, IEEE INFOCOM SER, P626, DOI [10.1109/INFOCOM41043.2020.9155446, 10.1109/infocom41043.2020.9155446]
[6] Communication-Efficient Federated Learning with Adaptive Parameter Freezing
Chen, Chen
Xu, Hong
Wang, Wei
Li, Baochun
Li, Bo
Chen, Li
Zhang, Gong
[J]. 2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 1 - 11
[7] Chen C, 2019, IEEE INFOCOM SER, P532, DOI [10.1109/infocom.2019.8737587, 10.1109/INFOCOM.2019.8737587]
[8] Chen TQ, 2015, Arxiv, DOI arXiv:1512.01274
[9] Chen ZX, 2023, Arxiv, DOI arXiv:2305.04279
[10] Dean J., 2012, PROC NEURIPS 12, P1232

← 1 2 3 4 5 6 →