A Dynamic Sliding Window Based Tensor Communication Scheduling Framework for Distributed Deep Learning

被引：0

作者：

Gao, Yunqi ^{[1
]}

Hu, Bing ^{[1
]}

Mashhadi, Mahdi Boloursaz ^{[2
]}

Wang, Wei ^{[1
]}

Tafazolli, Rahim ^{[2
]}

Debbah, Merouane ^{[3
]}

机构：

[1] Zhejiang Univ, Hangzhou 310027, Peoples R China

[2] Univ Surrey, 5GIC& 6G, Inst Commun Syst ICS, Guildford GU2 7XH, England

[3] Khalifa Univ, KU 6G Res Ctr, Dept Comp & Informat Engn, Abu Dhabi 127788, U Arab Emirates

来源：

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING | 2025年 / 12卷 / 02期

关键词：

Tensors; Training; Processor scheduling; Parallel processing; Dynamic scheduling; Computational modeling; Artificial neural networks; Computer architecture; Mathematical models; Energy consumption; Distributed deep learning; data parallelism; communication scheduling; tensor partitioning; generative pre-trained transformer (GPT); EFFICIENT;

D O I：

10.1109/TNSE.2024.3523320

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Simultaneous tensor communication can effectively improve the scalability of distributed deep learning on large clusters. However, a fixed number of tensor blocks communicated concurrently violates the priority-based scheduling strategy and cannot minimize communication overheads. In this paper, we propose a novel simultaneous tensor communication framework, namely D-Credit, which transmits tensor blocks based on dynamic sliding windows to minimize per-iteration time in distributed DNN training. We build the mathematical model of D-Credit in two phases: (1) the overlap of gradient communication and backward propagation, and (2) the overlap of gradient communication and forward computation. We drive the optimal window sizes for the second phase analytically, and develop a greedy algorithm to efficiently determine the dynamic window sizes for the first phase of D-Credit. We implement the D-Credit architecture on PyTorch framework. Experimental results on two different GPU clusters demonstrate that at training speed, D-Credit can achieve up to 1.26x, 1.21x, 1.48x and 1.53x speedup compared to ByteScheduler, DeAR, PyTorch-DDP and WFBP, respectively. At energy consumption, D-Credit saves up to 17.8% and 25.1% of the training energy consumption compared to ByteScheduler and WFBP, respectively.

引用

页码：1080 / 1095

页数：16

共 49 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2]

Alistarh D, 2017, ADV NEUR IN, V30

[3]

Amodei D, 2016, PR MACH LEARN RES, V48

[4]

Bao YX, 2020, IEEE INFOCOM SER, P626, DOI [10.1109/infocom41043.2020.9155446, 10.1109/INFOCOM41043.2020.9155446]

[5] Large Generative AI Models for Telecom: The Next Big Thing? [J].

Bariah, Lina ;

Zhao, Qiyang ;

Zou, Hang ;

Tian, Yu ;

Bader, Faouzi ;

Debbah, Merouane .

IEEE COMMUNICATIONS MAGAZINE, 2024, 62 (11) :84-90

[6] Large-Scale Machine Learning with Stochastic Gradient Descent [J].

Bottou, Leon .

COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, :177-186

[7]

Brown T., 2020, Advances in Neural Information Processing Systems, P1877, DOI DOI 10.48550/ARXIV.2005.14165

[8]

Chavan A, 2024, PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, P7980

[9] Communication-Efficient Federated Learning with Adaptive Parameter Freezing [J].

Chen, Chen ;

Xu, Hong ;

Wang, Wei ;

Li, Baochun ;

Li, Bo ;

Chen, Li ;

Zhang, Gong .

2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, :1-11

[10]

Chen C, 2019, IEEE INFOCOM SER, P532, DOI [10.1109/INFOCOM.2019.8737587, 10.1109/infocom.2019.8737587]

← 1 2 3 4 5 →