Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs

被引：16

作者：

Hu, Qinghao ^{[1
,2
]}

Zhang, Meng ^{[1
]}

Sun, Peng ^{[2
,3
]}

Wen, Yonggang ^{[4
]}

Zhang, Tianwei ^{[4
]}

机构：

[1] Nanyang Technol Univ, S Lab, Singapore, Singapore

[2] Shanghai AI Lab, Shanghai, Peoples R China

[3] SenseTime Res, Hong Kong, Peoples R China

[4] Nanyang Technol Univ, Singapore, Singapore

来源：

PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, VOL 2, ASPLOS 2023 | 2023年

关键词：

Cluster Management; Workload Scheduling; Machine Learning;

D O I：

10.1145/3575693.3575705

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While recent deep learning workload schedulers exhibit excellent performance, it is arduous to deploy them in practice due to some substantial defects, including inflexible intrusive manner, exorbitant integration and maintenance cost, limited scalability, as well as opaque decision processes. Motivated by these issues, we design and implement Lucid, a non-intrusive deep learning workload scheduler based on interpretable models. It consists of three innovative modules. First, a two-dimensional optimized profiler is introduced for efficient job metric collection and timely debugging job feedback. Second, Lucid utilizes an indolent packing strategy to circumvent interference. Third, Lucid orchestrates resources based on estimated job priority values and sharing scores to achieve efficient scheduling. Additionally, Lucid promotes model performance maintenance and system transparent adjustment via a well-designed system optimizer. Our evaluation shows that Lucid reduces the average job completion time by up to 1.3x compared with state-of-the-art preemptive scheduler Tiresias. Furthermore, it provides explicit system interpretations and excellent scalability for practical deployment.

引用

页码：457 / 472

页数：16

共 108 条

[1]

[Anonymous], 2023, NVIDIA Multi-Process Service

[2]

[Anonymous], 2023, NVIDIA Multi-Instance GPU

[3]

[Anonymous], 2014, 11 USENIX S OP SYST

[4]

[Anonymous], 2023, NVIDIA Data Center GPU Manager

[5]

[Anonymous], 2023, nvidia-smi

[6]

[Anonymous], 2023, gRPC: An RPC library and framework

[7] AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION [J].

AYER, M ;

BRUNK, HD ;

EWING, GM ;

REID, WT ;

SILVERMAN, E .

ANNALS OF MATHEMATICAL STATISTICS, 1955, 26 (04) :641-647

[8]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]

[9]

Bao YX, 2019, IEEE INFOCOM SER, P505, DOI [10.1109/INFOCOM.2019.8737460, 10.1109/infocom.2019.8737460]

[10] Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters [J].

Bian, Zhengda ;

Li, Shenggui ;

Wang, Wei ;

You, Yang .

SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,

← 1 2 3 4 5 6 7 8 9 10 →