Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning

被引:109
作者
Chaudhary, Shubham [1 ]
Ramjee, Ramachandran [1 ]
Sivathanu, Muthian [1 ]
Kwatra, Nipun [1 ]
Viswanatha, Srinidhi [1 ]
机构
[1] Microsoft Res India, Bengaluru, Karnataka, India
来源
PROCEEDINGS OF THE FIFTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS'20) | 2020年
关键词
D O I
10.1145/3342195.3387555
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present Gandiva(fair), a distributed, fair share scheduler that balances conflicting goals of efficiency and fairness in GPU clusters for deep learning training (DLT). Gandiva(fair) provides performance isolation between users, enabling multiple users to share a single cluster, thus, maximizing cluster efficiency. Gandiva(fair) is the first scheduler that allocates cluster-wide GPU time fairly among active users. Gandiva(fair) achieves efficiency and fairness despite cluster heterogeneity. Data centers host a mix of GPU generations because of the rapid pace at which newer and faster GPUs are released. As the newer generations face higher demand from users, older GPU generations suffer poor utilization, thus reducing cluster efficiency. Gandivafair profiles the variable marginal utility across various jobs from newer GPUs, and transparently incentivizes users to older GPUs by a novel resource trading mechanism that maximizes cluster efficiency without affecting fairness guarantees of any user. With a prototype implementation and evaluation in a heterogeneous 200GPU cluster, we show that Gandivafair achieves both fairness and efficiency under realistic multi-user workloads.
引用
收藏
页数:16
相关论文
共 37 条
[1]  
Ali G., 2013, P 8 ACM EUR C COMP S, P365
[2]  
[Anonymous], 2018, CoRR abs/1802.05799
[3]  
Ausubel L. M., 2006, Combinatorial auctions, V17, P22
[4]   Borg, Omega, and Kubernetes [J].
Burns, Brendan ;
Grant, Brian ;
Oppenheimer, David ;
Brewer, Eric ;
Wilkes, John .
COMMUNICATIONS OF THE ACM, 2016, 59 (05) :50-57
[5]   Hierarchical scheduling for symmetric multiprocessors [J].
Chandra, Abhishek ;
Shenoy, Prashant .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (03) :418-431
[6]  
Cho K., 2014, P SSST 8 8 WORKSH SY, DOI DOI 10.3115/V1/W14-4012
[7]   Utilization and predictability in scheduling the IBM SP2 with backfilling [J].
Feitelson, DG ;
Weil, AM .
FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, :542-546
[8]  
Ghodsi An, 2011, Computer Communication Review, V41, P507, DOI 10.1145/2018584.2018586
[9]  
Gog I, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P99
[10]  
Grandl R, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P81