Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments

被引:41
作者
Amaral, Marcelo [1 ]
Polo, Jorda [2 ]
Carrera, David [1 ]
Seelam, Seetharami [3 ]
Steinder, Malgorzata [3 ]
机构
[1] Univ Politecn Cataluna, Barcelona Supercomp Ctr, Barcelona, Spain
[2] Barcelona Supercomp Ctr, Barcelona, Spain
[3] IBM Watson Res Ctr, Yorktown Hts, NY USA
来源
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2017年
基金
欧洲研究理事会;
关键词
Scheduling; Placement; GPU; Multi-GPU; Performance Analysis; Resource Contention; Workload Interference and Deep Learning;
D O I
10.1145/3126908.3126933
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to approximate to 1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a largescale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.
引用
收藏
页数:12
相关论文
共 41 条
[1]   Performance Evaluation of Scientific Applications on POWER8 [J].
Adinetz, Andrew V. ;
Baumeister, Paul F. ;
Boettiger, Hans ;
Hater, Thorsten ;
Maurer, Thilo ;
Pleiter, Dirk ;
Schenck, Wolfram ;
Schifano, Sebastiano Fabio .
HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 :24-45
[2]  
[Anonymous], IEEE T PARALLEL DIST
[3]  
[Anonymous], P 13 ACM INT WORKSH
[4]  
[Anonymous], 2006, P 15 INT C PARALLEL
[5]  
[Anonymous], 2016, ACM QUEUE
[6]  
[Anonymous], P 8 USENIX C NETW SY
[7]  
[Anonymous], 2016, CORR
[8]  
[Anonymous], 2016, CORR
[9]  
[Anonymous], 1982, DES AUT C P, DOI DOI 10.1109/DAC.1982.1585498
[10]  
Bahrampour Soheil., 2015, CoRR