FaCS: Toward a Fault-Tolerant Cloud Scheduler Leveraging Long Short-Term Memory Network

被引：11

作者：

Islam, Tariqul ^{[1
]}

Manivannan, D. ^{[1
]}

机构：

[1] Univ Kentucky, Dept Comp Sci, Lexington, KY 40506 USA

来源：

2019 6TH IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (IEEE CSCLOUD 2019) / 2019 5TH IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD (IEEE EDGECOM 2019) | 2019年

关键词：

Fault-tolerance; Failure Prediction; Job and Task Scheduler; Long Short-Term Memory Network;

D O I：

10.1109/CSCloud/EdgeCom.2019.00010

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale cloud datacenters often experience reduced performance and service outage. Due to the inherent complexity, heterogeneity, and multitenant architecture of these datacenters, applications (i.e., jobs and tasks) running on them are susceptible to various types of failures. In this paper, we first characterize the application failures in Google cluster trace and then propose a prediction model which can forecast the termination status of a task. Then, we introduce a task scheduler that dynamically reschedules tasks based on the predicted results. This proactive fault-tolerant scheduler improves system reliability and ensures timely execution of the applications. Simulation results show that our scheduler reduces makespan and failure rates of tasks substantially while balancing load at the same time. Moreover, early prediction along with quick scheduling adjustment improves overall resource utilization and reduces resource wastage.

引用

页码：1 / 6

页数：6

共 12 条

[1]

Chollet F., 2017, Deep learning with python, manning publications, DOI DOI 10.1186/S12859-020-03546-X

[2] Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms [J].

Cortez, Eli ;

Bonde, Anand ;

Muzio, Alexandre ;

Russinovich, Mark ;

Fontoura, Marcus ;

Bianchini, Ricardo .

PROCEEDINGS OF THE TWENTY-SIXTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '17), 2017, :153-167

[3] GloudSim: Google trace based cloud simulator with virtual machines [J].

Di, Sheng ;

Cappello, Franck .

SOFTWARE-PRACTICE & EXPERIENCE, 2015, 45 (11) :1571-1590

[4] Prior node selection for scheduling workflows in a heterogeneous system [J].

Kanemitsu, Hidehiro ;

Hanada, Masaki ;

Nakazato, Hidenori .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 109 :155-177

[5] Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems [J].

Kang, Seungmin ;

Veeravalli, Bharadwaj ;

Aung, Khin Mi Mi .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 113 :1-16

[6]

Lipton Z.C., 2015, CRITICAL REV RECURRE, V1506, P19, DOI DOI 10.1145/2647868.2654889

[7] A Reinforcement Learning-based Mixed Job Scheduler Scheme for Cloud Computing under SLA Constraint [J].

Peng, Zhiping ;

Cui, Delong ;

Ma, Yuanjia ;

Xiong, Jianbin ;

Xu, Bo ;

Lin, Weiwei .

2016 IEEE 3RD INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (CSCLOUD), 2016, :142-147

[8]

Reiss C., TECH REP

[9] DieHard: reliable scheduling to survive correlated failures in cloud data centers [J].

Sedaghat, Mina ;

Wadbro, Eddie ;

Wilkes, John ;

De Luna, Sara ;

Seleznjev, Oleg ;

Elmroth, Erik .

2016 16TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2016, :52-59

[10]

Soualhia Mbarka, 2015, 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC), P1, DOI 10.1109/PCCC.2015.7410316

← 1 2 →