Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

被引:0
|
作者
Liu, Kaiyang [1 ]
Wang, Jingrong [2 ]
Huang, Zhiming [3 ]
Pan, Jianping [3 ]
机构
[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1B 3X5, Canada
[2] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5S 3G4, Canada
[3] Univ Victoria, Dept Comp Sci, Victoria, BC V8P 5C2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Training; Deep learning; Load management; Processor scheduling; Computational modeling; Throughput; Parallel processing; Distributed deep learning; job placement; job sizing; load balancing; heterogeneity-aware scheduling; fairness;
D O I
10.1109/TPDS.2024.3390109
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.
引用
收藏
页码:874 / 888
页数:15
相关论文
共 50 条
  • [1] Elastic Deep Learning in Multi-Tenant GPU Clusters
    Wu, Yidi
    Ma, Kaihao
    Yan, Xiao
    Liu, Zhi
    Cai, Zhenkun
    Huang, Yuzhen
    Cheng, James
    Yuan, Han
    Yu, Fan
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (01) : 144 - 158
  • [2] Interference-aware opportunistic job placement for shared distributed deep learning clusters
    Li, Hongliang
    Zhao, Hairui
    Sun, Ting
    Li, Xiang
    Xu, Haixiao
    Li, Keqin
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 183
  • [3] A direct sampling-based deep learning approach for inverse medium scattering problems
    Ning, Jianfeng
    Han, Fuqun
    Zou, Jun
    INVERSE PROBLEMS, 2024, 40 (01)
  • [4] Stratified Sampling-Based Deep Learning Approach to Increase Prediction Accuracy of Unbalanced Dataset
    Sadaiyandi, Jeyabharathy
    Arumugam, Padmapriya
    Sangaiah, Arun Kumar
    Zhang, Chao
    ELECTRONICS, 2023, 12 (21)
  • [5] Fraud detection for job placement using hierarchical clusters-based deep neural networks
    Jeongrae Kim
    Han-Joon Kim
    Hyoungrae Kim
    Applied Intelligence, 2019, 49 : 2842 - 2861
  • [6] Fraud detection for job placement using hierarchical clusters-based deep neural networks
    Kim, Jeongrae
    Kim, Han-Joon
    Kim, Hyoungrae
    APPLIED INTELLIGENCE, 2019, 49 (08) : 2842 - 2861
  • [7] Deep Learning Based User Association in Heterogeneous Wireless Networks
    Zhang, Yalin
    Xiong, Liang
    Yu, Jia
    IEEE ACCESS, 2020, 8 : 197439 - 197447
  • [8] Multi-Channel Opportunistic Access for Heterogeneous Networks Based on Deep Reinforcement Learning
    Ye, Xiaowen
    Yu, Yiding
    Fu, Liqun
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2022, 21 (02) : 794 - 807
  • [9] A Sampling-Based Stack Framework for Imbalanced Learning in Churn Prediction
    De, Soumi
    Prabu, P.
    IEEE ACCESS, 2022, 10 : 68017 - 68028
  • [10] Dynamic Multi-Objective Service Function Chain Placement Based on Deep Reinforcement Learning
    Zhou, Cong
    Zhao, Baokang
    Tang, Fengxiao
    Han, Biao
    Wang, Baosheng
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2025, 22 (01): : 15 - 29