Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems

被引：0

作者：

Akber, Syed Muhammad Abrar ^{[1
]}

Chen, Hanhua ^{[1
]}

Wang, Yonghui ^{[1
]}

Jin, Hai ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Serv Comp Technol & Syst Lab, Cluster & Grid Comp Lab,Big Data Technol & Syst L, Wuhan 430074, Hubei, Peoples R China

来源：

2018 IEEE 7TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET) | 2018年

关键词：

resilience; fault tolerance; checkpoints; distributed stream processing; distributed systems;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Failure occurrence in large-scale systems is inevitable, which makes the resilience a key challenge for modern systems. Checkpoints with rollback recovery is a well-known approach to provide fault tolerance in distributed systems. The checkpoint based fault tolerance approach periodically persists the application state to reliable storage, which serves as a recovery point in case of failure. These periodic checkpoints are not inline with the failure rate of the systems as many studies conclude that failure occurrence is not periodic. The optimal size of checkpoint interval is a crucial decision, which directly determines the checkpoint overheads. To minimize the checkpoint overheads, we propose to reduce the number of checkpoints during the application execution. We suggest reducing the number of checkpoints by successively increasing the checkpoint intervals. We consider the failure probability of the underlying infrastructure and iteratively increase the checkpoint intervals. The proposed checkpoint approach tailors the checkpoint initializing based on the failure probability. If failure probability is low, it increases the checkpoint interval, and eventually reduces the total number of checkpoints triggered during application timespan. Reducing the total number of checkpoints during application execution results in decreasing the checkpoint overheads. The experiment results show that the proposed checkpoint policy considerably reduces the checkpoint overheads as compared to periodic checkpoints.

引用

页数：4

共 8 条

[1]

Bautista-Gomez Leonardo Arturo, 2016, P IPDPS CHIC IL US 2

[2]

Carbone P., 2015, ARXIV150608603

[3]

Chen H., 2017, P ICNP TOR ON CAN 10

[4]

El-Sayed Nosayba, 2014, P CLUST MADR SPAIN 2

[5] The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems [J].

Javadi, Bahman ;

Kondo, Derrick ;

Iosup, Alexandru ;

Epema, Dick .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (08) :1208-1223

[6]

Naksinehaboon Nichamon, 2008, P CCGRID LYON FRANC

[7] Two-State Checkpointing for Energy-Efficient Fault Tolerance in Hard Real-Time Systems [J].

Salehi, Mohammad ;

Tavana, Mohammad Khavari ;

Rehman, Semeen ;

Shafique, Muhammad ;

Ejlali, Alireza ;

Henkel, Joerg .

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (07) :2426-2437

[8] A Large-Scale Study of Failures in High-Performance Computing Systems [J].

Schroeder, Bianca ;

Gibson, Garth A. .

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2010, 7 (04) :337-350

← 1 →