Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

被引：0

作者：

Liu, Kaiyang ^{[1
]}

Wang, Jingrong ^{[2
]}

Huang, Zhiming ^{[3
]}

Pan, Jianping ^{[3
]}

机构：

[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1B 3X5, Canada

[2] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5S 3G4, Canada

[3] Univ Victoria, Dept Comp Sci, Victoria, BC V8P 5C2, Canada

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2024年 / 35卷 / 06期

基金：

加拿大自然科学与工程研究理事会;

关键词：

Training; Deep learning; Load management; Processor scheduling; Computational modeling; Throughput; Parallel processing; Distributed deep learning; job placement; job sizing; load balancing; heterogeneity-aware scheduling; fairness;

D O I：

10.1109/TPDS.2024.3390109

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.

引用

页码：874 / 888

页数：15

共 47 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2] A scalable, commodity data center network architecture [J].

Al-Fares, Mohammad ;

Loukissas, Alexander ;

Vahdat, Amin .

ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :63-74

[3]

[Anonymous], 2016, WikiText-2

[4]

[Anonymous], 2017, Tiny ImageNet

[5]

Baidu-AllReduce, 2017, About us

[6]

Bao YX, 2019, IEEE INFOCOM SER, P505, DOI [10.1109/infocom.2019.8737460, 10.1109/INFOCOM.2019.8737460]

[7] Accelerating Distributed Learning in Non-Dedicated Environments [J].

Chen, Chen ;

Weng, Qizhen ;

Wang, Wei ;

Li, Baochun ;

Li, Bo .

IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (01) :515-531

[8]

Chen TQ, 2015, Arxiv, DOI arXiv:1512.01274

[9]

CIFAR10, 2009, About us

[10]

Compute Canada, 2024, About us

← 1 2 3 4 5 →