Improving Failure Tolerance in Large-Scale Cloud Computing Systems

被引:12
作者
Luo, Liang [1 ]
Meng, Sa [1 ]
Qiu, Xiwei [1 ]
Dai, Yuanshun [1 ]
机构
[1] Univ Elect Sci & Technol, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
Cloud computing; failure tolerance; large-scale system; Markov model; MANAGEMENT; STRATEGY; IAAS;
D O I
10.1109/TR.2019.2901194
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale cloud computing systems have served as the fundamental supporting platform for big data, Internet of Things, and artificial intelligence applications for the past decade. With the scale and complexity of these systems increasing dramatically, various hardware and software failures will inevitably occur and may not be detected and repaired in a timely manner. Besides, sophisticated architectural features of cloud computing may also have an adverse impact on system reliability. In response to these challenges, this paper proposes a simulation-driven framework based on real cloud computing system operation logs for improving failure tolerance in large-scale cloud computing systems. For a given cloud computing system, we first conduct a systematic analysis of its structure and operation characteristics. A Markov-based model is used to examine the system's potential failures, assess their severities, and suggest quick recoveries. During this process, the proposed reliability-aware resource scheduling algorithm is adopted to optimize resources so that the system's reliability can be improved cost-effectively. We also report a case study to demonstrate the application of our algorithm in improving failure tolerance of a large-scale cloud computing system.
引用
收藏
页码:620 / 632
页数:13
相关论文
共 52 条
[1]  
Aguilera MK, 1997, LECT NOTES COMPUT SC, V1320, P126, DOI 10.1007/BFb0030680
[2]   A View of Cloud Computing [J].
Armbrust, Michael ;
Fox, Armando ;
Griffith, Rean ;
Joseph, Anthony D. ;
Katz, Randy ;
Konwinski, Andy ;
Lee, Gunho ;
Patterson, David ;
Rabkin, Ariel ;
Stoica, Ion ;
Zaharia, Matei .
COMMUNICATIONS OF THE ACM, 2010, 53 (04) :50-58
[3]  
Bala A., 2012, Int J Comput Sci Issues, V9, P288
[4]  
Bauer E., 2012, Reliability and availability of cloud computing
[5]  
Beloglazov Anton, 2010, Proceedings 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), P826, DOI 10.1109/CCGRID.2010.46
[6]   Survey on Network Virtualization Hypervisors for Software Defined Networking [J].
Blenk, Andreas ;
Basta, Arsany ;
Reisslein, Martin ;
Kellerer, Wolfgang .
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, 2016, 18 (01) :655-685
[7]   Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility [J].
Buyya, Rajkumar ;
Yeo, Chee Shin ;
Venugopal, Srikumar ;
Broberg, James ;
Brandic, Ivona .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2009, 25 (06) :599-616
[8]   CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms [J].
Calheiros, Rodrigo N. ;
Ranjan, Rajiv ;
Beloglazov, Anton ;
De Rose, Cesar A. F. ;
Buyya, Rajkumar .
SOFTWARE-PRACTICE & EXPERIENCE, 2011, 41 (01) :23-50
[9]   Experience Report: On the Impact of Software Faults in the Privileged Virtual Machine [J].
Cerveira, Frederico ;
Barbosa, Raul ;
Madeira, Henrique .
2017 IEEE 28TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE), 2017, :136-145
[10]  
Dai YT, 2009, INT C WAVEL ANAL PAT, P1, DOI 10.1109/ICWAPR.2009.5207466