On a Meta Learning-Based Scheduler for Deep Learning Clusters

被引:1
作者
Yang, Jin [1 ]
Bao, Liang [1 ]
Liu, Wenjing [1 ]
Yang, Rong [1 ]
Wu, Chase Q. [2 ]
机构
[1] Xidian Univ, Xian 710071, Peoples R China
[2] New Jersey Inst Technol, Newark, NJ 07102 USA
基金
中国国家自然科学基金;
关键词
Deep learning cluster; worker placement; meta learning; reinforcement learning;
D O I
10.1109/TCC.2023.3308161
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning (DL) has become a dominating type of workloads on AI computing platforms. The performance of such platforms highly depends on how distributed DL jobs are scheduled. Reinforcement learning (RL)-based schedulers have been extensively studied and are capable of modeling interferences between concurrent jobs competing for resources. However, existing RL-based schedulers must learn from large number of samples and adapt to workload changes in real systems, which is a huge cost for production clusters. This paper proposes an intelligent, autonomous scheduler that employs sample-efficient RL for real-world resource scheduling on complex DL clusters. Specifically, we design a closed-loop meta-RL-based worker placement algorithm for DL training jobs. Instead of random exploration, we encourage the scheduler to explore combinatorial subspaces, where the performance model might be inaccurate, to improve the sampling efficiency of the scheduler agent. Extensive experimental results demonstrate that our algorithm outperforms other baselines in terms of average job completion time with 12.29% to 16.24% improvements. Further experiments with workload variations yield 15.76% to 22.13% improvements.
引用
收藏
页码:3631 / 3642
页数:12
相关论文
共 49 条
  • [1] Alet F, 2020, Arxiv, DOI arXiv:2003.05325
  • [2] MARS: Malleable Actor-Critic Reinforcement Learning Scheduler
    Baheri, Betis
    Tronge, Jacob
    Fang, Bo
    Li, Ang
    Chaudhary, Vipin
    Guan, Qiang
    [J]. 2022 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE, IPCCC, 2022,
  • [3] Bao YX, 2019, IEEE INFOCOM SER, P505, DOI [10.1109/INFOCOM.2019.8737460, 10.1109/infocom.2019.8737460]
  • [4] Bao YX, 2018, IEEE INFOCOM SER, P495, DOI 10.1109/INFOCOM.2018.8486422
  • [5] Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning
    Chaudhary, Shubham
    Ramjee, Ramachandran
    Sivathanu, Muthian
    Kwatra, Nipun
    Viswanatha, Srinidhi
    [J]. PROCEEDINGS OF THE FIFTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS'20), 2020,
  • [6] Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters
    Chen, Zhaoyun
    Quan, Wei
    Wen, Mei
    Fang, Jianbin
    Yu, Jie
    Zhang, Chunyuan
    Luo, Lei
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (01) : 34 - 50
  • [7] SCARL: Attentive Reinforcement Learning-Based Scheduling in a Multi-Resource Heterogeneous Cluster
    Cheong, Mukoe
    Lee, Hyunsung
    Yeom, Ikjun
    Woo, Honguk
    [J]. IEEE ACCESS, 2019, 7 (153432-153444) : 153432 - 153444
  • [8] Duan Y, 2016, Arxiv, DOI [arXiv:1611.02779, DOI 10.48550/ARXIV.1611.02779]
  • [9] Dulac-Arnold G, 2019, Arxiv, DOI arXiv:1904.12901
  • [10] Challenges of real-world reinforcement learning: definitions, benchmarks and analysis
    Dulac-Arnold, Gabriel
    Levine, Nir
    Mankowitz, Daniel J.
    Li, Jerry
    Paduraru, Cosmin
    Gowal, Sven
    Hester, Todd
    [J]. MACHINE LEARNING, 2021, 110 (09) : 2419 - 2468