Fault tolerant adaptive parallel and distributed simulation through functional replication

被引:3
作者
D'Angelo, Gabriele [1 ]
Ferretti, Stefano [1 ]
Marzolla, Moreno [1 ]
机构
[1] Univ Bologna, Dept Comp Sci & Engn, Mura Anteo Zamboni 7, I-40127 Bologna, Italy
关键词
Simulation; Parallel and distributed simulation; Fault tolerance; Adaptive systems; Middleware; Agent-based simulation;
D O I
10.1016/j.simpat.2018.09.012
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.
引用
收藏
页码:192 / 207
页数:16
相关论文
共 37 条
[1]  
AGRAWAL D, 1992, 1992 WINTER SIMULATION CONFERENCE PROCEEDINGS, P657, DOI 10.1145/167293.167662
[2]  
[Anonymous], 2010, IEEE Std.1044-2009, P1, DOI [DOI 10.1109/IEEESTD.2010.5439063, DOI 10.1109/IEEESTD.2010.5553440]
[3]  
[Anonymous], 2000, 1516 IEEE
[4]   THE N-VERSION APPROACH TO FAULT-TOLERANT SOFTWARE [J].
AVIZIENIS, A .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1985, 11 (12) :1491-1501
[5]  
BOLCH G., 1998, Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications
[6]  
Bononi L., 2005, P 2005 WORKSH TECHN, DOI [10.1109/FIRB-PERF.2005.17, DOI 10.1109/FIRB-PERF.2005.17]
[7]   ASYNCHRONOUS DISTRIBUTED SIMULATION VIA A SEQUENCE OF PARALLEL COMPUTATIONS [J].
CHANDY, KM ;
MISRA, J .
COMMUNICATIONS OF THE ACM, 1981, 24 (04) :198-206
[8]   A decoupled federate architecture for high level architecture-based distributed simulation [J].
Chen, Dan ;
Turner, Stephen John ;
Cai, Wentong ;
Xiong, Muzhou .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2008, 68 (11) :1487-1503
[9]   UNDERSTANDING FAULT-TOLERANT DISTRIBUTED SYSTEMS [J].
CRISTIAN, F .
COMMUNICATIONS OF THE ACM, 1991, 34 (02) :56-78
[10]  
D'Angelo G., P SIMUTOOLS ROME ITA, DOI DOI 10.4108/ICST.SIMUTOOLS2009.5672