Building and using a fault-tolerant MPI implementation

被引:12
作者
Fagg, GE
Dongarra, JJ
机构
[1] High Performance Comp Ctr Stuttgart, D-70550 Stuttgart, Germany
[2] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
关键词
fault tolerant; message passing; parallel computing; MPI;
D O I
10.1177/1094342004046052
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.
引用
收藏
页码:353 / 361
页数:9
相关论文
共 50 条
  • [31] Review on Fault-Tolerant NoC Designs
    Jun-Shi Wang
    Le-Tian Huang
    Journal of Electronic Science and Technology, 2018, 16 (03) : 191 - 221
  • [32] Inherent Fault-Tolerant Multilevel Inverter
    Phukan, Hillol
    Tiwari, Dinesh Kumar
    Singh, Jiwanjot
    Pati, Avadh
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
  • [33] Fault-tolerant meshes with small degree
    Zhang, L
    IEEE TRANSACTIONS ON COMPUTERS, 2002, 51 (05) : 553 - 560
  • [34] Fault-tolerant servers for anycast communication
    Yu, S
    Zhou, W
    Jia, W
    PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 1244 - 1250
  • [35] A MDP Approach to Fault-Tolerant Routing
    Pietrabissa, Antonio
    Castrucci, Marco
    Palo, Andi
    EUROPEAN JOURNAL OF CONTROL, 2012, 18 (04) : 334 - 347
  • [36] Considerations for fault-tolerant network on chips
    Ali, M
    Welzl, M
    Zwicknagl, M
    Hellebrand, S
    17TH ICM 2005: 2005 INTERNATIONAL CONFERENCE ON MICROELECTRONICS, PROCEEDINGS, 2005, : 178 - 182
  • [37] A MapReduce system with fault-tolerant mechanism
    Shi, Yi
    Geng, Chen
    Qi, Yong
    Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2014, 48 (02): : 1 - 7
  • [38] A fault-tolerant architecture for ATM networks
    Lo, CC
    Chiou, CY
    COMPUTER COMMUNICATIONS, 1999, 22 (17) : 1540 - 1548
  • [39] Fault-tolerant Sequences of Operation for VAV AHU Systems through Building Performance Simulation
    Torabi, Narges
    Gunay, Burak
    O'Brien, William
    Moromisato, Ricardo
    PROCEEDINGS OF THE 2022 THE 9TH ACM INTERNATIONAL CONFERENCE ON SYSTEMS FOR ENERGY-EFFICIENT BUILDINGS, CITIES, AND TRANSPORTATION, BUILDSYS 2022, 2022, : 21 - 29
  • [40] Fault-Tolerant Parallel Integer Multiplication
    Nissim, Roy
    Schwartz, Oded
    Spiizer, Yuval
    PROCEEDINGS OF THE 36TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, SPAA 2024, 2024, : 207 - 218