Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster

被引:0
|
作者
Sultana, Abeda [1 ]
Xu, Fei [2 ]
Yuan, Xu [3 ]
Chen, Li [1 ]
Tzeng, Nian-Feng [1 ]
机构
[1] Univ Louisiana Lafayette, Sch Comp & Informat, Lafayette, LA USA
[2] East China Normal Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[3] Univ Delaware, Dept Comp & Informat Sci, Newark, DE USA
基金
美国国家科学基金会;
关键词
distributed deep learning; scheduling; optimization;
D O I
10.1109/IPDPS57955.2024.00066
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
With the wide adoption of deep neural network (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs and TPUs, for DNN training jobs. To arbitrate cluster resources among multi-user jobs, existing schedulers fall short, either lacking fine-grained heterogeneity awareness or hardly generalizable to various scheduling policies. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an online optimization framework that can express other scheduling algorithms. Hadar leverages the performance traits of DNN jobs on a heterogeneous cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. The primal-dual framework is employed, with our design of a dual subroutine, to solve the optimization problem and guide the scheduling design. Extensive trace-driven simulations with representative DNN models have been conducted to demonstrate that Hadar improves the average job completion time (JCT) by 3x over an Apache YARN-based resource manager used in production. Moreover, Hadar outperforms Gavel[1], the state-of-the-art heterogeneity-aware scheduler, by 2.5x for the average JCT, shortens the queuing delay by 13%, and improves FTF (Finish-Time-Fairness) by 1.5%.
引用
收藏
页码:681 / 691
页数:11
相关论文
共 50 条
  • [1] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
    Narayanan, Deepak
    Santhanam, Keshav
    Kazhamiaka, Fiodar
    Phanishayee, Amar
    Zaharia, Matei
    PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), 2020, : 481 - 498
  • [2] SCHEDTUNE: A Heterogeneity-Aware GPU Scheduler for Deep Learning
    Albahar, Hadeel
    Dongare, Shruti
    Du, Yanlin
    Zhao, Nannan
    Paul, Arnab K.
    Butt, Ali R.
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 695 - 705
  • [3] Heterogeneity-aware Deep Learning Workload Deployments on the Computing Continuum
    Bouvier, Thomas
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 1027 - 1027
  • [4] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
    Subramanya, Suhas Jayaram
    Arfeen, Daiyaan
    Lin, Shouxu
    Qiao, Aurick
    Jia, Zhihao
    Ganger, Gregory R.
    PROCEEDINGS OF THE TWENTY-NINTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, SOSP 2023, 2023, : 642 - 657
  • [5] Heterogeneity-Aware Scheduling on SoCs for Autonomous Vehicles
    Amarnath, Aporva
    Pal, Subhankar
    Kassa, Hiwot Tadese
    Vega, Augusto
    Buyuktosunoglu, Alper
    Franke, Hubertus
    Wellman, John-David
    Dreslinski, Ronald
    Bose, Pradip
    IEEE COMPUTER ARCHITECTURE LETTERS, 2021, 20 (02) : 82 - 85
  • [6] Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization
    Zhou, Qihua
    Guo, Song
    Qu, Zhihao
    Li, Peng
    Li, Li
    Guo, Minyi
    Wang, Kun
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (05) : 1030 - 1043
  • [7] Heterogeneity-aware fair federated learning
    Li, Xiaoli
    Zhao, Siran
    Chen, Chuan
    Zheng, Zibin
    INFORMATION SCIENCES, 2023, 619 : 968 - 986
  • [8] FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning
    Kim, Young Geun
    Wu, Carole-Jean
    2022 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2022), 2022, : 117 - 129
  • [9] Predictive Heterogeneity-Aware Application Scheduling for Chip Multiprocessors
    Chen, Jian
    Nair, Arun Arvind
    John, Lizy K.
    IEEE TRANSACTIONS ON COMPUTERS, 2014, 63 (02) : 435 - 447
  • [10] SplitAVG: A Heterogeneity-Aware Federated Deep Learning Method for Medical Imaging
    Zhang, Miao
    Qu, Liangqiong
    Singh, Praveer
    Kalpathy-Cramer, Jayashree
    Rubin, Daniel L.
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (09) : 4635 - 4644