Fault-Aware Runtime Strategies for High-Performance Computing

被引:16
作者
Li, Yawei [1 ]
Lan, Zhiling [1 ]
Gujrati, Prashasta [1 ]
Sun, Xian-He [1 ]
机构
[1] IIT, Dept Comp Sci, Chicago, IL 60616 USA
基金
美国国家科学基金会;
关键词
High-performance computing; runtime strategies; fault tolerance; performance; reliability; 0-1; knapsack; MAXIMIZING RELIABILITY; TASK ALLOCATION;
D O I
10.1109/TPDS.2008.128
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure prediction and fault tolerance techniques, construct a runtime system called Fault-Aware Runtime System (FARS). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).
引用
收藏
页码:460 / 473
页数:14
相关论文
共 48 条
[1]   Scheduling with unexpected machine breakdowns [J].
Albers, S ;
Schmidt, G .
DISCRETE APPLIED MATHEMATICS, 2001, 110 (2-3) :85-99
[2]  
[Anonymous], P IEEE INT C DAT MIN
[3]  
[Anonymous], 2005, THESIS U ILLINOIS UR
[4]  
[Anonymous], 2008, PARALLEL WORKLOADS A
[5]   New grid scheduling and rescheduling methods in the GrADS Project [J].
Berman, F ;
Casanova, H ;
Chien, A ;
Cooper, K ;
Dail, H ;
Dasgupta, A ;
Deng, W ;
Dongarra, J ;
Johnsson, L ;
Kennedy, K ;
Koelbel, C ;
Liu, B ;
Liu, X ;
Mandal, A ;
Marin, G ;
Mazina, M ;
Mellor-Crummey, J ;
Mendes, C ;
Olugbile, A ;
Patel, M ;
Reed, D ;
Shi, Z ;
Sievert, O ;
Xia, H ;
YarKhan, A .
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2005, 33 (2-3) :209-229
[6]   MPICH-V project: A multiprotocol automatic fault-tolerant MPI [J].
Bouteiller, A. ;
Herault, T. ;
Krawezik, G. ;
Lemarinier, P. ;
Cappello, F. .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2006, 20 (03) :319-333
[7]  
Chakravorty S, 2006, LECT NOTES COMPUT SC, V4297, P485
[8]  
Cormen T.H., 2001, Introduction To Algorithms, Vsecond
[9]   Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing [J].
Dogan, A ;
Özgüner, F .
2000 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2000, :307-314
[10]  
Du C, 2006, SIXTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, P11