Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

被引:0
作者
Luo, Yizhou [1 ]
Wang, Qiang [1 ]
Shi, Shaohuai [1 ]
Lai, Jiaxin [1 ]
Qi, Shuhan [1 ]
Zhang, Jiajia [1 ]
Wang, Xuan [1 ]
机构
[1] Harbin Inst Technol Shenzhen, Guangdong Prov Key Lab Novel Secur Intelligence T, Shenzhen, Peoples R China
来源
2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS | 2024年
基金
中国国家自然科学基金;
关键词
Distributed Deep Learning; Job Scheduling; Communication Contention;
D O I
10.1109/IWQoS61813.2024.10682877
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemption-free policy, SJF-BSBF reduces the average job completion time by 27-33% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17% in large-scale traces.
引用
收藏
页数:10
相关论文
共 33 条
  • [1] Agarwal S., 2024, EUROSYS 24
  • [2] Bao YX, 2019, IEEE INFOCOM SER, P505, DOI [10.1109/INFOCOM.2019.8737460, 10.1109/infocom.2019.8737460]
  • [3] Minimum Makespan Workflow Scheduling for Malleable Jobs with Precedence Constraints and Lifetime Resource Demands
    Chen, Chen
    Ke, Xiaodi
    Zeyl, Timothy
    Du, Kaixiang
    Sanjabi, Sam
    Bergsma, Shane
    Pournaghi, Reza
    Chen, Chong
    [J]. 2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 2068 - 2078
  • [4] Dean J., 2012, Advances in Neural Information Processing Systems, P1223
  • [5] Giroire F, 2019, IEEE INFOCOM SER, P2278, DOI [10.1109/infocom.2019.8737415, 10.1109/INFOCOM.2019.8737415]
  • [6] Gu JC, 2019, PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P485
  • [7] Hoefler T, 2010, LECT NOTES COMPUT SC, V6305, P21, DOI 10.1007/978-3-642-15646-5_3
  • [8] Hsu A, 2019, PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING, P39
  • [9] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
    Hu, Qinghao
    Zhang, Meng
    Sun, Peng
    Wen, Yonggang
    Zhang, Tianwei
    [J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, VOL 2, ASPLOS 2023, 2023, : 457 - 472
  • [10] Spear: Optimized Dependency-Aware Task Scheduling with Deep Reinforcement Learning
    Hu, Zhiming
    Tu, James
    Li, Baochun
    [J]. 2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 2037 - 2046