Optimum Interval for Application-level Checkpoints

被引:6
作者
Siavvas, Miltiadis [1 ,2 ]
Gelenbe, Erol [3 ]
机构
[1] Imperial Coll London, London, England
[2] Ctr Res & Technol Hellas, Thessaloniki, Greece
[3] Polish Acad Sci, Inst Theoret & Appl Informat, Gliwice, Poland
来源
2019 6TH IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (IEEE CSCLOUD 2019) / 2019 5TH IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD (IEEE EDGECOM 2019) | 2019年
基金
欧盟地平线“2020”;
关键词
Cloud Computing; Software Reliability; Roll Back Recovery; Application Level Checkpoints; Optimum Checkpoints; Program Loops; AVAILABILITY; SYSTEMS;
D O I
10.1109/CSCloud/EdgeCom.2019.000-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Checkpointing is commonly adopted for enhancing the performance of software applications that operate in the presence of failures. Among the existing checkpointing strategies, Application-level Checkpoint and Restart (ALCR) is considered the most efficient, since it leaves smaller memory footprint, but it requires significant development effort. Although existing ALCR tools and libraries manage to reduce the effort required for implementing the checkpoints, they do not provide recommendations regarding their inter-checkpoint interval. To this end, in the present paper, we develop a mathematical model to estimate the optimum checkpoint interval, i.e., the interval between two successive checkpoints that minimises the average execution time of the application. The case of programs with loops and nested loops is also discussed. The results are illustrated with several numerical examples.
引用
收藏
页码:145 / 150
页数:6
相关论文
共 40 条
[11]   An Analysis of Multilevel Checkpoint Performance Models [J].
Dauwe, Daniel ;
Pasricha, Sudeep ;
Maciejewski, Anthony A. ;
Siegel, Howard Jay .
2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, :783-792
[12]  
Dawei Sun, 2012, International Journal of Security and Networks, V7, P196
[13]  
Dijkstra E., 1969, SOFTWARE ENG TECHNIQ
[14]  
Duell J., 2002, BERKELEY LAB PUBLICA
[15]   A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems [J].
Egwutuoha, Ifeanyi P. ;
Levy, David ;
Selic, Bran ;
Chen, Shiping .
JOURNAL OF SUPERCOMPUTING, 2013, 65 (03) :1302-1326
[16]   A survey of rollback-recovery protocols in message-passing systems [J].
Elnozahy, EN ;
Alvisi, L ;
Wang, YM ;
Johnson, DB .
ACM COMPUTING SURVEYS, 2002, 34 (03) :375-408
[17]   Dealing with software viruses: A biological paradigm [J].
Electrical and Electronic Engineering Department, Imperial College, London, SW7 2BT, United Kingdom .
Information Security Technical Report, 2007, 12 (04) :242-250
[18]  
GELENBE E, 1990, ACTA INFORM, V27, P519, DOI 10.1007/BF00277388
[19]   AVAILABILITY OF A DISTRIBUTED COMPUTER-SYSTEM WITH FAILURES [J].
GELENBE, E ;
FINKEL, D ;
TRIPATHI, SK .
ACTA INFORMATICA, 1986, 23 (06) :643-655
[20]  
GELENBE E, 1994, THEOR COMPUT SCI, V125, P131, DOI 10.1016/0304-3975(94)90297-6