Recovery for Virtualized Environments

被引:15
作者
Cerveira, Frederico [1 ]
Barbosa, Raul [1 ]
Madeira, Henrique [1 ]
Araujo, Filipe [1 ]
机构
[1] Univ Coimbra, Dept Informat Engn, CISUC, P-3030290 Coimbra, Portugal
来源
2015 ELEVENTH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC) | 2015年
关键词
Virtualization; fault injection; cloud computing; fault tolerance; dependability;
D O I
10.1109/EDCC.2015.26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud infrastructures provide elastic computing resources to client organizations, enabling them to build online applications while avoiding the fixed costs associated to a complete IT infrastructure. However, such organizations are unlikely to fully trust the cloud for the most critical applications. Among other threats, soft errors are expected to increase with the shrinking geometries of transistors, and many errors are left for the software layers to correct and mask. This paper characterizes the behavior of a virtualized environment, using Xen with CentOS as the hypervisor, in presence of soft errors. One of the main threats arises from soft errors directly affecting the hypervisor, as these faults have the potential to disrupt several virtual machines at once. With this in mind, we develop a fault tolerant architecture for cloud applications, which relies on experimental data collected using fault injection to guide its design. This architecture recovers from bit-flip errors with the help of a watchdog timer, to securely reboot the hypervisor. Nevertheless, errors might still propagate outside the system, for example to a client in a client-server interaction. Despite this, our results suggest that our architecture and a few simple techniques, like timers on the client, can recover a very large fraction of errors in client-server applications with small hardware and performance overhead. Conversely, the fraction of errors requiring Byzantine fault-tolerant techniques is quite small, thus restricting those expensive approaches to highly critical applications.
引用
收藏
页码:25 / 36
页数:12
相关论文
共 19 条
[1]  
[Anonymous], 2003, ACM SIGOPS OPERATING
[2]  
[Anonymous], 2003, P NETW DISTR SYST SE
[3]   Xception: A technique for the experimental evaluation of dependability in modern computers [J].
Carreira, J ;
Madeira, H ;
Silva, JG .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (02) :125-136
[4]  
Chisnall D., 2007, DEFINITIVE GUIDE XEN, V2nd
[5]  
Cully B., 2008, P USENIX C NETW SYST
[6]  
Fielding R., 1999, Tech. Rep
[7]   LOW-DENSITY PARITY-CHECK CODES [J].
GALLAGER, RG .
IRE TRANSACTIONS ON INFORMATION THEORY, 1962, 8 (01) :21-&
[8]  
Intel, 2013, TECH REP
[9]  
Le M., 2008, 1 INT WORKSH VIRT PE
[10]  
Long Wang, 2010, 2010 IEEE 16th International On-Line Testing Symposium (IOLTS 2010), P97, DOI 10.1109/IOLTS.2010.5560226