Building and using a fault-tolerant MPI implementation

被引:12
|
作者
Fagg, GE
Dongarra, JJ
机构
[1] High Performance Comp Ctr Stuttgart, D-70550 Stuttgart, Germany
[2] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
来源
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2004年 / 18卷 / 03期
关键词
fault tolerant; message passing; parallel computing; MPI;
D O I
10.1177/1094342004046052
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.
引用
收藏
页码:353 / 361
页数:9
相关论文
共 50 条
  • [21] MPICH-V project: A multiprotocol automatic fault-tolerant MPI
    Bouteiller, A.
    Herault, T.
    Krawezik, G.
    Lemarinier, P.
    Cappello, F.
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2006, 20 (03): : 319 - 333
  • [22] A Fast-start, Fault-tolerant MPI Launcher on Dawning Supercomputers
    Liu, Xu
    Tu, Bibo
    Zhan, Jianfeng
    Meng, Dan
    PDCAT 2008: NINTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2008, : 263 - 266
  • [23] A fault tolerant MPI-10 implementation using the expand parallel file system
    Calderón, A
    García-Carballeira, F
    Carretero, J
    Pérez, JM
    Sánchez, LM
    13TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2005, : 274 - 281
  • [24] Lessons learned in building a fault-tolerant CORBA system
    Narasimhan, P
    Moser, LE
    Melliar-Smith, PM
    INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2002, : 39 - 44
  • [25] Qubit metrology for building a fault-tolerant quantum computer
    Martinis, John M.
    NPJ QUANTUM INFORMATION, 2015, 1
  • [26] Fault-tolerant optimal control of a building HVAC system
    Bengea, Sorin C.
    Li, Pengfei
    Sarkar, Soumik
    Vichik, Sergey
    Adetola, Veronica
    Kang, Keunmo
    Lovett, Teems
    Leonardi, Francesco
    Kelman, Anthony D.
    SCIENCE AND TECHNOLOGY FOR THE BUILT ENVIRONMENT, 2015, 21 (06) : 734 - 751
  • [27] Qubit metrology for building a fault-tolerant quantum computer
    John M Martinis
    npj Quantum Information, 1
  • [28] Fault-Tolerant Containers Using NiLiCon
    Zhou, Diyu
    Tamir, Yuval
    2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 1082 - 1091
  • [29] Design and implementation of fault-tolerant mechanism in BH RTI
    Key Laboratory of Virtual Reality Technologies, Ministry of Education, Beihang University, Beijing 100083, China
    Xitong Fangzhen Xuebao, 2006, 8 (2133-2136+2161):
  • [30] FAULT-TOLERANT DATA ACQUISITION NETWORK - IMPLEMENTATION OF A PROTOTYPE
    MINONI, U
    SANSONI, G
    MICROPROCESSING AND MICROPROGRAMMING, 1989, 26 (04): : 231 - 240