A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications

被引:106
作者
Colman-Meixner, Carlos [1 ]
Develder, Chris [2 ]
Tornatore, Massimo [3 ]
Mukherjee, Biswanath [1 ]
机构
[1] Univ Calif Davis, Dept Elect & Comp Engn, Davis, CA 95616 USA
[2] Ghent Univ iMinds, Intec IBCN, BE-9052 Ghent, Belgium
[3] Politecn Milan, I-20133 Milan, Italy
关键词
Cloud computing; resilience; virtualization; middleware; optical networks; disaster resilience; DATA CENTER NETWORK; VIRTUAL MACHINE PLACEMENT; BYZANTINE FAULT-TOLERANCE; HIGH-AVAILABILITY; SERVER INTERCONNECTION; RELIABILITY; MANAGEMENT; FAILURE; DESIGN; ENVIRONMENTS;
D O I
10.1109/COMST.2016.2531104
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today's businesses increasingly rely on cloud computing, which brings both great opportunities and challenges. One of the critical challenges is resiliency: disruptions due to failures (either accidental or because of disasters or attacks) may entail significant revenue losses (e.g., US$ 25.5 billion in 2010 for North America). Such failures may originate at any of the major components in a cloud architecture (and propagate to others): 1) the servers hosting the application; 2) the network interconnecting them (on different scales, inside a data center, up to wide-area connections); or 3) the application itself. We comprehensively survey a large body of work focusing on resilience of cloud computing, in each (or a combination) of the server, network, and application components. First, we present the cloud computing architecture and its key concepts. We highlight both the infrastructure (servers, network) and application components. A key concept is virtualization of infrastructure (i.e., partitioning into logically separate units), and thus we detail the components in both physical and virtual layers. Before moving to the detailed resilience aspects, we provide a qualitative overview of the types of failures that may occur (from the perspective of the layered cloud architecture), and their consequences. The second major part of the paper introduces and categorizes a large number of techniques for cloud computing infrastructure resiliency. This ranges from designing and operating the facilities, servers, networks, to their integration and virtualization (e.g., also including resilience of the middleware infrastructure). The third part focuses on resilience in application design and development. We study how applications are designed, installed, and replicated to survive multiple physical failure scenarios as well as disaster failures.
引用
收藏
页码:2244 / 2281
页数:38
相关论文
共 235 条
[1]  
Abbadi IM, 2011, COMM COM INF SC, V193, P406
[2]   Scafida: A Scale-Free Network Inspired Data Center Architecture [J].
Agarwal, Sharad .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2010, 40 (05) :4-12
[3]  
Aggarwal N, 2007, CONF PROC INT SYMP C, P470, DOI 10.1145/1273440.1250720
[4]   MillWheel: Fault-Tolerant Stream Processing at Internet Scale [J].
Akidau, Tyler ;
Balikov, Alex ;
Bekiroglu, Kaya ;
Chernyak, Slava ;
Haberman, Josh ;
Lax, Reuven ;
McVeety, Sam ;
Mills, Daniel ;
Nordstrom, Paul ;
Whittle, Sam .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (11) :1033-1044
[5]   A scalable, commodity data center network architecture [J].
Al-Fares, Mohammad ;
Loukissas, Alexander ;
Vahdat, Amin .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :63-74
[6]   A Comparative Analysis of Network Dependability, Fault-tolerance, Reliability, Security, and Survivability [J].
Al-Kuwaiti, M. ;
Kyriakopoulos, N. ;
Hussein, S. .
IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, 2009, 11 (02) :106-124
[7]  
Al-Qahtani FS, 2013, 2013 INTERNATIONAL CONFERENCE ON COMPUTING, MANAGEMENT AND TELECOMMUNICATIONS (COMMANTEL), P1, DOI 10.1109/ComManTel.2013.6482355
[8]  
[Anonymous], SP800145 NIST DEP CO
[9]  
[Anonymous], P IEEE OSA OPT FIB C
[10]  
[Anonymous], 001EN12013 IWGCR EUU