An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

被引:43
|
作者
Wang, Shaoqi [1 ]
Gonzalez, Oscar J. [2 ]
Zhou, Xiaobo [1 ]
Williams, Thomas [2 ]
Friedman, Brian D. [2 ]
Havemann, Martin [2 ]
Woo, Thomas [2 ]
机构
[1] Univ Colorado, Dept Comp Sci, Colorado Springs, CO 80907 USA
[2] Nokia Bell Labs, New Providence, NJ USA
来源
PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20) | 2020年
关键词
deep learning; GPU dusters; resource scheduling; container; Kubernetes;
D O I
10.1109/SC41405.2020.00094
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks. We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPIJ allocation mechanism works in concert with the scheduler. It offers a lightweight and non-intrusive method to reallocate Gl'Us based on a "SideCar" process that temporarily stops and restarts the job's DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU dusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination-based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] A Non-Intrusive Deep Learning Based Fall Detection Scheme Using Video Cameras
    Pourazad, Mahsa T.
    Shojaei-Hashemi, Anahita
    Nasiopoulos, Panos
    Azimi, Maryam
    Mak, Michelle
    Grace, Jennifer
    Jung, Doojin
    Bains, Taran
    2020 34TH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2020), 2020, : 443 - 446
  • [22] Light-Weight Recurrent Deep Learning Algorithm for Non-Intrusive Load Monitoring
    Mobasher-Kashani, Mohammad
    Li, Jiaming
    Luo, Suhuai
    PROCEEDINGS OF 2019 IEEE 2ND INTERNATIONAL CONFERENCE ON ELECTRONIC INFORMATION AND COMMUNICATION TECHNOLOGY (ICEICT 2019), 2019, : 572 - 575
  • [23] Non-intrusive Load Disaggregation Based on Deep Learning and Multi-feature Fusion
    Liu, Hang
    Liu, Chunyang
    Tian, Lijun
    Zhao, Haoran
    Liu, Junwei
    2021 3RD INTERNATIONAL CONFERENCE ON SMART POWER & INTERNET ENERGY SYSTEMS (SPIES 2021), 2021, : 210 - 215
  • [24] A Non-Intrusive Automated Testing System for Internet of Vehicles App Based on Deep Learning
    Zhang, Yanan
    Guo, Zhen
    Sun, Tao
    ELECTRONICS, 2023, 12 (13)
  • [25] DRA-net: A new deep learning framwork for non-intrusive load disaggregation
    Yu, Fang
    Wang, Zhihua
    Zhang, Xiaodong
    Xia, Min
    FRONTIERS IN ENERGY RESEARCH, 2023, 11
  • [26] ELECTRIcity: An Efficient Transformer for Non-Intrusive Load Monitoring
    Sykiotis, Stavros
    Kaselimi, Maria
    Doulamis, Anastasios
    Doulamis, Nikolaos
    SENSORS, 2022, 22 (08)
  • [27] Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning
    Kang, Dong-Ki
    Lee, Ki-Beom
    Kim, Young-Chon
    ENERGIES, 2022, 15 (02)
  • [28] A Multi-Task Deep Learning Approach for Non-Intrusive Load Monitoring of Multiple Appliances
    Dash, Suryalok
    Sahoo, N. C.
    IEEE TRANSACTIONS ON SMART GRID, 2024, 15 (03) : 3337 - 3340
  • [29] Torch-NILM: An Effective Deep Learning Toolkit for Non-Intrusive Load Monitoring in Pytorch
    Gkalinikis, Nikolaos Virtsionis
    Nalmpantis, Christoforos
    Vrakas, Dimitris
    ENERGIES, 2022, 15 (07)
  • [30] Non-Intrusive Fish Weight Estimation in Turbid Water Using Deep Learning and Regression Models
    Tengtrairat, Naruephorn
    Woo, Wai Lok
    Parathai, Phetcharat
    Rinchumphu, Damrongsak
    Chaichana, Chatchawan
    SENSORS, 2022, 22 (14)