Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters

被引:0
作者
Liu, Kaiyang [1 ]
Wang, Jingrong [2 ]
Huang, Zhiming [3 ]
Pan, Jianping [3 ]
机构
[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF A1B 3X5, Canada
[2] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5S 3G4, Canada
[3] Univ Victoria, Dept Comp Sci, Victoria, BC V8P 5C2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Training; Deep learning; Load management; Processor scheduling; Computational modeling; Throughput; Parallel processing; Distributed deep learning; job placement; job sizing; load balancing; heterogeneity-aware scheduling; fairness;
D O I
10.1109/TPDS.2024.3390109
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.
引用
收藏
页码:874 / 888
页数:15
相关论文
共 50 条
  • [31] Integrated Localization and Tracking for AUV With Model Uncertainties via Scalable Sampling-Based Reinforcement Learning Approach
    Yan, Jing
    Li, Xin
    Yang, Xian
    Luo, Xiaoyuan
    Hua, Changchun
    Guan, Xinping
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2022, 52 (11): : 6952 - 6967
  • [32] TATA: Throughput-Aware TAsk Placement in Heterogeneous Stream Processing with Deep Reinforcement Learning
    Huang, Xiao
    Jiang, Yu
    Fan, Hao
    Tang, Huayun
    Wang, Yiping
    Jin, Jin
    Wan, Hai
    Zhao, Xibin
    19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 44 - 54
  • [33] FedAEB: Deep Reinforcement Learning Based Joint Client Selection and Resource Allocation Strategy for Heterogeneous Federated Learning
    Zheng, Feng
    Sun, Yuze
    Ni, Bin
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2024, 73 (06) : 8835 - 8846
  • [34] A unified deep learning framework for urban functional zone extraction based on multi-source heterogeneous data
    Lu, Weipeng
    Tao, Chao
    Li, Haifeng
    Qi, Ji
    Li, Yansheng
    REMOTE SENSING OF ENVIRONMENT, 2022, 270
  • [35] Fault diagnosis based on deep learning by extracting inherent common feature of multi-source heterogeneous data
    Zhou, Funa
    Yang, Shuai
    He, Yifan
    Chen, Danmin
    Wen, Chenglin
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART I-JOURNAL OF SYSTEMS AND CONTROL ENGINEERING, 2021, 235 (10) : 1858 - 1872
  • [36] Hierarchical Heterogeneous Multi-Agent Cross-Domain Search Method Based on Deep Reinforcement Learning
    Dong, Shangqun
    Liu, Meiqin
    Dong, Shanling
    Zheng, Ronghao
    Wei, Ping
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (11) : 18872 - 18883
  • [37] An Adaptive Analytic FPGA Placement Framework based on Deep-Learning
    Al-Hyari, Abeer
    Shamli, Ahmed
    Martin, Timothy
    Areibi, Shawki
    Grewal, Gary
    PROCEEDINGS OF THE 2020 ACM/IEEE 2ND WORKSHOP ON MACHINE LEARNING FOR CAD (MLCAD '20), 2020, : 3 - 8
  • [38] Multi-Objective Deep Reinforcement Learning Assisted Service Function Chains Placement
    Bi, Yu
    Meixner, Carlos Colman
    Bunyakitanon, Monchai
    Vasilakos, Xenofon
    Nejabati, Reza
    Simeonidou, Dimitra
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2021, 18 (04): : 4134 - 4150
  • [39] Xonar: Profiling-based Job Orderer for Distributed Deep Learning
    Shin, Changyong
    Yang, Gyeongsik
    Yoo, Yeonho
    Lee, Jeunghwan
    Yoo, Chuck
    2022 IEEE 15TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (IEEE CLOUD 2022), 2022, : 112 - 114
  • [40] Reciprocal Transformation-Based Joint Deep and Broad Learning for Change Detection With Heterogeneous Images
    Yang, Bin
    Wang, Zhulian
    Liu, Xinxin
    Fang, Leyuan
    Liu, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62