An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

被引：43

作者：

Wang, Shaoqi ^{[1
]}

Gonzalez, Oscar J. ^{[2
]}

Zhou, Xiaobo ^{[1
]}

Williams, Thomas ^{[2
]}

Friedman, Brian D. ^{[2
]}

Havemann, Martin ^{[2
]}

Woo, Thomas ^{[2
]}

机构：

[1] Univ Colorado, Dept Comp Sci, Colorado Springs, CO 80907 USA

[2] Nokia Bell Labs, New Providence, NJ USA

来源：

PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20) | 2020年

关键词：

deep learning; GPU dusters; resource scheduling; container; Kubernetes;

D O I：

10.1109/SC41405.2020.00094

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks. We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPIJ allocation mechanism works in concert with the scheduler. It offers a lightweight and non-intrusive method to reallocate Gl'Us based on a "SideCar" process that temporarily stops and restarts the job's DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU dusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination-based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.

引用

页数：13

共 50 条

[1] Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
Cao, Jiamin
Guan, Yu
Qian, Kun
Gao, Jiaqi
Xiao, Wencong
Dong, Jianbo
Fu, Binzhang
Cai, Dennis
Zhai, Ennan
PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024, 2024, : 1 - 15
[2] Non-Intrusive A/C Load Disaggregation Using Deep Learning
Cho, Jin
Hu, Zhen
Sartipi, Mina
2018 IEEE/PES TRANSMISSION AND DISTRIBUTION CONFERENCE AND EXPOSITION (T&D), 2018,
[3] A Non-Intrusive Deep Learning Based Diagnosis System for Elevators
Chai, Songjian
Li, Xuran Ivan
Jia, Youwei
He, Yufei
Yip, Chi Ho
Cheung, Ka Kei
Wang, Minghao
IEEE ACCESS, 2021, 9 : 20993 - 21003
[4] Efficient Multi-Training Framework of Image Deep Learning on GPU Cluster
Chen, Chun-Fu
Lee, Gwo Giun
Xia, Yinglong
Lin, W. Sabrina
Suzumura, Toyotaro
Lin, Ching-Yung
2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 489 - 494
[5] A weakly supervised active learning framework for non-intrusive load monitoring
Tanoni, Giulia
Sobot, Tamara
Principi, Emanuele
Stankovic, Vladimir
Stankovic, Lina
Squartini, Stefano
INTEGRATED COMPUTER-AIDED ENGINEERING, 2025, 32 (01) : 37 - 54
[6] Non-Intrusive Water Surface Velocity Measurement Based on Deep Learning
An, Guocheng
Du, Tiantian
He, Jin
Zhang, Yanwei
WATER, 2024, 16 (19)
[7] Deep Learning-Based Non-Intrusive Commercial Load Monitoring
Zhou, Mengran
Shao, Shuai
Wang, Xu
Zhu, Ziwei
Hu, Feng
SENSORS, 2022, 22 (14)
[8] Non-intrusive model reduction of large-scale, nonlinear dynamical systems using deep learning
Gao, Han
Wang, Jian-Xun
Zahr, Matthew J.
PHYSICA D-NONLINEAR PHENOMENA, 2020, 412
[9] Self-Adaptive Non-Intrusive Load Monitoring Using Deep Learning
Arampola, S. M. L.
Nisakya, M. S. K.
Yasodya, W. A.
Kumarawadu, S.
Logeeshan, V
Wanigasekara, C.
2024 IEEE 5TH ANNUAL WORLD AI IOT CONGRESS, AIIOT 2024, 2024, : 0540 - 0545
[10] Tracking Defective Panel on Photovoltaic Strings with Non-Intrusive Monitoring and Deep Learning
Rocha, Helder R. O.
Silva, Andre
Coura, Daniel J. C.
Silvestre, Leonardo J.
Junior, Luis O. Rigo
Silva, Jair A. L.
Celeste, Wanderley C.
JOURNAL OF CONTROL AUTOMATION AND ELECTRICAL SYSTEMS, 2024, 35 (04) : 688 - 701

← 1 2 3 4 5 →