Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

被引:0
作者
Liu, Kaiyang [1 ]
Wang, Jingrong [2 ]
Huang, Zhiming [3 ]
Pan, Jianping [3 ]
机构
[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1B 3X5, Canada
[2] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5S 3G4, Canada
[3] Univ Victoria, Dept Comp Sci, Victoria, BC V8P 5C2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Training; Deep learning; Load management; Processor scheduling; Computational modeling; Throughput; Parallel processing; Distributed deep learning; job placement; job sizing; load balancing; heterogeneity-aware scheduling; fairness;
D O I
10.1109/TPDS.2024.3390109
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.
引用
收藏
页码:874 / 888
页数:15
相关论文
共 50 条
  • [41] JOINT OPTIMIZATION OF SAMPLING PATTERN AND PRIORS IN MODEL BASED DEEP LEARNING
    Aggarwal, Hemant K.
    Jacob, Mathews
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 926 - 929
  • [42] Research on food safety sampling inspection system based on deep learning
    Chen, Tzu-Chia
    Yu, Shu-Yan
    FOOD SCIENCE AND TECHNOLOGY, 2022, 42
  • [43] Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning
    Chu, Ching-Hsiang
    Lu, Xiaoyi
    Awan, Ammar A.
    Subramoni, Hari
    Hashmi, Jahanzeb
    Elton, Bracy
    Panda, Dhabaleswar K.
    2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 161 - 170
  • [44] MULTI-PURPOSE CHESTNUT CLUSTERS DETECTION USING DEEP LEARNING: A PRELIMINARY APPROACH
    Adao, Telmo
    Padua, Luis
    Pinho, Tatiana M.
    Hruska, Jonas
    Sousa, Antonio
    Sousa, Joaquim Joao
    Morais, Raul
    Peres, Emanuel
    ISPRS ICWG III/IVA GI4DM 2019 - GEOINFORMATION FOR DISASTER MANAGEMENT, 2019, 42-3 (W8): : 1 - 7
  • [45] A Multi-Task-Learning-Based Transfer Deep Reinforcement Learning Design for Autonomic Optical Networks
    Chen, Xiaoliang
    Proietti, Roberto
    Liu, Che-Yu
    Yoo, S. J. Ben
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2021, 39 (09) : 2878 - 2889
  • [46] Multi-proxy based deep metric learning
    Chan, Patrick P. K.
    Li, Shute
    Deng, Jingwen
    Yeung, Daniel S.
    INFORMATION SCIENCES, 2023, 643
  • [47] Multi-objective recognition based on deep learning
    Liu, Xin
    Wu, Junhui
    Man, Yiyun
    Xu, Xibao
    Guo, Jifeng
    AIRCRAFT ENGINEERING AND AEROSPACE TECHNOLOGY, 2020, 92 (08) : 1185 - 1193
  • [48] DLPF: A parallel deep learning programming framework based on heterogeneous architecture
    Wang Y.
    Dou Y.
    Lü Q.
    Li B.
    Li T.
    1600, Science Press (53): : 1202 - 1210
  • [49] Deep Learning based Intelligent Recognition Method in Heterogeneous Communication Networks
    Gu, Hao
    Wang, Yu
    Hong, Sheng
    Xu, Yongjun
    Gui, Guan
    2020 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC), 2020, : 478 - 482
  • [50] HDQGF:Heterogeneous Data Quality Guarantee Framework Based on Deep Learning
    Zhang, Yun
    Jin, Zongze
    Zhu, Weilin
    Chi, Lei
    Wang, Weiping
    PROCEEDINGS OF THE 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2021, : 901 - 906