Optimal Server Selection for Straggler Mitigation

被引:17
作者
Badita, Ajay [1 ]
Parag, Parimal [1 ]
Aggarwal, Vaneet [2 ,3 ]
机构
[1] Indian Inst Sci, Dept Elect & Commun Engn, Bengaluru 560012, India
[2] Purdue Univ, Sch Ind Engn, W Lafayette, IN 47907 USA
[3] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA
基金
美国国家科学基金会;
关键词
Servers; Task analysis; Job shop scheduling; Redundancy; Processor scheduling; IEEE transactions; Straggler mitigation; distributed computing; shifted exponential distribution; completion time; scheduling; forking points; REDUNDANT REQUESTS;
D O I
10.1109/TNET.2020.2973224
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. To manage stragglers, we consider a multi-fork approach for job scheduling, where additional parallel servers are added at forking instants. In terms of the forking instants and the number of additional servers, we compute the job completion time and the cost of server utilization when the task processing times are assumed to have a shifted exponential distribution. We use this study to provide insights into the scheduling design of the forking instants and the associated number of additional servers to be started. Numerical results demonstrate orders of magnitude improvement in cost in the regime of low completion times as compared to the prior works.
引用
收藏
页码:709 / 721
页数:13
相关论文
共 31 条
[11]  
Bitar R, 2017, IEEE INT SYMP INFO, P2900, DOI 10.1109/ISIT.2017.8007060
[12]   An interior point algorithm for large-scale nonlinear programming [J].
Byrd, RH ;
Hribar, ME ;
Nocedal, J .
SIAM JOURNAL ON OPTIMIZATION, 1999, 9 (04) :877-900
[13]   Improving MapReduce Performance in Heterogeneous Environments with Adaptive Task Tuning [J].
Cheng, Dazhao ;
Rao, Jia ;
Guo, Yanfei ;
Zhou, Xiaobo .
ACM/IFIP/USENIX MIDDLEWARE 2014, 2014, :97-108
[14]   On the Lambert W function [J].
Corless, RM ;
Gonnet, GH ;
Hare, DEG ;
Jeffrey, DJ ;
Knuth, DE .
ADVANCES IN COMPUTATIONAL MATHEMATICS, 1996, 5 (04) :329-359
[15]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[16]   The Tail at Scale [J].
Dean, Jeffrey ;
Barroso, Luiz Andre .
COMMUNICATIONS OF THE ACM, 2013, 56 (02) :74-80
[17]  
Gardner Kristen, 2015, ACM SIGMETRICS Performance Evaluation Review, V43, P347
[18]   Queueing with redundant requests: exact analysis [J].
Gardner, Kristen ;
Zbarsky, Samuel ;
Doroudi, Sherwin ;
Harchol-Balter, Mor ;
Hyytia, Esa ;
Scheller-Wolf, Alan .
QUEUEING SYSTEMS, 2016, 83 (3-4) :227-259
[19]   Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters [J].
Garraghan, Peter ;
Ouyang, Xue ;
Yang, Renyu ;
McKee, David ;
Xu, Jie .
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2019, 12 (01) :91-104
[20]   Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution [J].
Guo, Yanfei ;
Rao, Jia ;
Jiang, Changjun ;
Zhou, Xiaobo .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (03) :798-812