An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

被引:43
|
作者
Wang, Shaoqi [1 ]
Gonzalez, Oscar J. [2 ]
Zhou, Xiaobo [1 ]
Williams, Thomas [2 ]
Friedman, Brian D. [2 ]
Havemann, Martin [2 ]
Woo, Thomas [2 ]
机构
[1] Univ Colorado, Dept Comp Sci, Colorado Springs, CO 80907 USA
[2] Nokia Bell Labs, New Providence, NJ USA
来源
PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20) | 2020年
关键词
deep learning; GPU dusters; resource scheduling; container; Kubernetes;
D O I
10.1109/SC41405.2020.00094
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks. We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPIJ allocation mechanism works in concert with the scheduler. It offers a lightweight and non-intrusive method to reallocate Gl'Us based on a "SideCar" process that temporarily stops and restarts the job's DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU dusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination-based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Non-Intrusive Load Monitoring via Deep Learning Based User Model and Appliance Group Model
    Peng, Ce
    Lin, Guoying
    Zhai, Shaopeng
    Ding, Yi
    He, Guangyu
    ENERGIES, 2020, 13 (21)
  • [42] Applying Deep Learning to Track Food Consumption and Human Activity for Non-intrusive Blood Glucose Monitoring
    Samir, Mohamed Amr
    Mohamed, Zeinab A.
    Hussein, Mona Abdelmotaleb A.
    Atia, Ayman
    2021 IEEE 12TH ANNUAL UBIQUITOUS COMPUTING, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE (UEMCON), 2021, : 319 - 324
  • [43] Deep Neural Network Based Non-Intrusive Load Status Recognition
    Kundu, Arnav
    Juvekar, Gandhali Prakash
    Davis, Katherine
    2018 CLEMSON UNIVERSITY POWER SYSTEMS CONFERENCE (PSC), 2018,
  • [44] Deep Adaptive Ensemble Filter for Non-Intrusive Residential Load Monitoring
    Kianpoor, Nasrin
    Hoff, Bjarte
    Ostrem, Trond
    SENSORS, 2023, 23 (04)
  • [45] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
    Pan, Rui
    Lei, Yiming
    Li, Jialong
    Xie, Zhiqiang
    Yuan, Binhang
    Xia, Yiting
    THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 93 - 100
  • [46] Scheduling Deep Learning Training in GPU Cluster Using the Model-Similarity-Based Policy
    Thanapol, Panissara
    Lavangnananda, Kittichai
    Leprevost, Franck
    Schleich, Julien
    Bouvry, Pascal
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2023, PT II, 2023, 13996 : 363 - 374
  • [47] On performance evaluation and machine learning approaches in non-intrusive load monitoring
    Klemenjak C.
    Klemenjak, Christoph (christoph.klemenjak@aau.at), 2018, Springer Nature (01) : 391 - 395
  • [48] Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training
    Choi, Hyeonseong
    Lee, Jaehwan
    APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [49] Comparative Analysis of Machine Learning Techniques for Non-Intrusive Load Monitoring
    Shabbir, Noman
    Vassiljeva, Kristina
    Nourollahi Hokmabad, Hossein
    Husev, Oleksandr
    Petlenkov, Eduard
    Belikov, Juri
    ELECTRONICS, 2024, 13 (08)
  • [50] Non-intrusive load monitoring system for similar loads identification using feature mapping and deep learning techniques
    Kumar, Mukesh
    Gopinath, R.
    Harikrishna, P.
    Srinivas, Kota
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2021, 32 (12)