Predictable quality of service atop degradable distributed systems

被引:5
作者
Ramakrishnan, Lavanya [1 ]
Reed, Daniel A. [2 ]
机构
[1] Indiana Univ, Bloomington, IN 47405 USA
[2] Microsoft Res, Redmond, WA USA
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2013年 / 16卷 / 02期
基金
美国国家科学基金会;
关键词
Performability; Reliability; Workflow scheduling; Fault tolerance; Grid and cloud resource management;
D O I
10.1007/s10586-009-0078-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.
引用
收藏
页码:321 / 334
页数:14
相关论文
共 22 条
[1]  
Alonso G., 2000, ENHANCING THE FAULT
[2]  
Blythe J, 2005, 2005 IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, P759
[3]  
Braun T.D., 2001, J PARALLEL DISTRIB C
[4]  
da Lu C., 2004, PROC OF SUPERCOMPUTI
[5]  
DROEGEMEIER K., 2005, COMPUT SCI ENG
[6]  
Haverkort B., 2001, PERFORMABILITY MODEL
[7]  
Hwang S., 2003, J GRID COMPUT
[8]  
Jia Yu, 2006, Scientific Programming, V14, P217
[9]  
Kennedy K., 2002, PROCEEDINGS OF NSF N
[10]  
Khalili O., 2006, THE 7TH IEEE ACM INT