HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

被引:13
作者
Luo, Yi [1 ]
Manivannan, D. [1 ]
机构
[1] Univ Kentucky, Dept Comp Sci, Lexington, KY 40506 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2012年 / 28卷 / 08期
基金
美国国家科学基金会;
关键词
Large scale systems; Checkpointing protocols; Message logging protocols; Consistent global checkpoint; Fault tolerance; Failure recovery in distributed systems; RECOVERY;
D O I
10.1016/j.future.2012.03.012
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Future generation supercomputers will be message-passing distributed systems consisting of hundreds of thousands of processors. As the size of the system grows, failure rate increases. Hence for the success and deployability of such large scale systems, scalable checkpointing and recovery protocols need to be implemented. Existing checkpointing and rollback recovery protocols used for providing fault tolerance in distributed systems are not scalable to such large systems. In this paper, we address this important and timely issue and propose a scalable group-based Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging (HOPE) protocol. Performance evaluation indicates, our protocol takes a balanced approach to lower checkpointing and message logging overhead and enhances scalability. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:1217 / 1235
页数:19
相关论文
共 36 条
[1]  
Alvisi L., 1993, Digest of Papers FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing, P145, DOI 10.1109/FTCS.1993.627318
[2]   Causality tracking in causal message-logging protocols [J].
Alvisi, L ;
Bhatia, K ;
Marzullo, K .
DISTRIBUTED COMPUTING, 2002, 15 (01) :1-15
[3]  
Aminian M., 2006, Proceedings. 20th International Parallel and Distributed Processing Symposium (IEEE Cat. No.06TH8860)
[4]  
Bouteiller A, 2011, LECT NOTES COMPUT SC, V6853, P51, DOI 10.1007/978-3-642-23397-5_6
[5]  
Bouteiller Aurelien., 2003, Supercomputing Conference, P25, DOI [DOI 10.1145/1048935.1050176, 10.1145/1048935.1050176]
[6]   Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols [J].
Buntinas, Darius ;
Coti, Camille ;
Herault, Thomas ;
Lemarinier, Pierre ;
Pilard, Laurence ;
Rezmerita, Ala ;
Rodriguez, Eric ;
Cappello, Franck .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2008, 24 (01) :73-84
[7]   Checkpointing for Peta-scale systems: A look into the future of practical rollback-recovery [J].
Elnozahy, EN ;
Plank, JS .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (02) :97-108
[8]   A survey of rollback-recovery protocols in message-passing systems [J].
Elnozahy, EN ;
Alvisi, L ;
Wang, YM ;
Johnson, DB .
ACM COMPUTING SURVEYS, 2002, 34 (03) :375-408
[9]  
Engelmann C, 2005, LECT NOTES COMPUT SC, V3514, P313
[10]   Independent checkpointing in a heterogeneous grid environment [J].
Feller, Eugen ;
Mehnert-Spahn, John ;
Schoettner, Michael ;
Morin, Christine .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2012, 28 (01) :163-170