Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

被引:0
作者
Liu, Kaiyang [1 ]
Wang, Jingrong [2 ]
Huang, Zhiming [3 ]
Pan, Jianping [3 ]
机构
[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1B 3X5, Canada
[2] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5S 3G4, Canada
[3] Univ Victoria, Dept Comp Sci, Victoria, BC V8P 5C2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Training; Deep learning; Load management; Processor scheduling; Computational modeling; Throughput; Parallel processing; Distributed deep learning; job placement; job sizing; load balancing; heterogeneity-aware scheduling; fairness;
D O I
10.1109/TPDS.2024.3390109
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.
引用
收藏
页码:874 / 888
页数:15
相关论文
共 50 条
  • [21] Recommended Model for Fusing Multi-Source Heterogeneous Data Based on Deep Learning
    Ji Z.-Y.
    Song X.-J.
    Pi H.-Y.
    Yang C.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2019, 42 (06): : 35 - 42
  • [22] JPAS: Job-progress-aware flow scheduling for deep learning clusters
    Zhou, Pan
    He, Xinshu
    Luo, Shouxi
    Yu, Hongfang
    Sun, Gang
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2020, 158 (158)
  • [23] Data Partitioning Strategy of GPU Heterogeneous Clusters Based on Learning
    Li, Jianjiang
    Chen, Wei
    Tian, Jin
    Zheng, Hongyan
    Zhang, Peng
    Liu, Yajun
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (09): : 403 - 418
  • [24] Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis
    Zhang, Wenlu
    Li, Rongjian
    Zeng, Tao
    Sun, Qian
    Kumar, Sudhir
    Ye, Jieping
    Ji, Shuiwang
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 322 - 333
  • [25] Deep Learning-Based Resource Allocation Scheme for Heterogeneous NOMA Networks
    Kim, Donghyeon
    Kwon, Sean
    Jung, Haejoon
    Lee, In-Ho
    IEEE ACCESS, 2023, 11 : 89423 - 89432
  • [26] A Deep Learning-Based Approach for Foot Placement Prediction
    Lee, Sung-Wook
    Asbeck, Alan
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (08) : 4959 - 4966
  • [27] Edge service placement strategy based on distributed deep learning
    Zou H.
    Bai C.
    He P.
    Cui Y.
    Wang R.
    Wu D.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2022, 44 (05): : 1728 - 1737
  • [28] Deep learning-based microstructure analysis of multi-component heterogeneous composites during preparation
    Li, Haozhen
    Wei, Chong
    Cao, Zixiong
    Zhang, Yi
    Li, Xiaoqiang
    COMPOSITES PART A-APPLIED SCIENCE AND MANUFACTURING, 2024, 186
  • [29] Multi-View Deep Network: A Deep Model Based on Learning Features From Heterogeneous Neural Networks for Sentiment Analysis
    Sadr, Hossein
    Pedram, Mir Mohsen
    Teshnehlab, Mohammad
    IEEE ACCESS, 2020, 8 : 86984 - 86997
  • [30] A Comparison of Re-Sampling Techniques for Detection of Multi-Step Attacks on Deep Learning Models
    Jamal, Muhammad Hassan
    Naz, Naila
    Khattak, Muazzam A. Khan
    Saeed, Faisal
    Altamimi, Saad Nasser
    Qasem, Sultan Noman
    IEEE ACCESS, 2023, 11 : 127446 - 127457