Building and using a fault-tolerant MPI implementation

被引:12
作者
Fagg, GE
Dongarra, JJ
机构
[1] High Performance Comp Ctr Stuttgart, D-70550 Stuttgart, Germany
[2] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
关键词
fault tolerant; message passing; parallel computing; MPI;
D O I
10.1177/1094342004046052
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.
引用
收藏
页码:353 / 361
页数:9
相关论文
共 50 条
  • [41] A Low-Cost Fault-Tolerant Approach for Hardware Implementation of Artificial Neural Networks
    Ahmadi, A.
    Sargolzaie, M. H.
    Fakhraie, S. M.
    Lucas, C.
    Vakili, Sh.
    2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND TECHNOLOGY, VOL II, PROCEEDINGS, 2009, : 93 - 97
  • [42] Fault-Tolerant Visual Secret Sharing Scheme Using Meaningful Shares
    Chung, Yu-Chun
    Ou, Jia-Hao
    Juan, Justie Su-Tzu
    2019 IEEE 10TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST 2019), 2019, : 474 - 479
  • [43] Using Imbalance Characteristic for Fault-Tolerant Workflow Scheduling in Cloud Systems
    Yao, Guangshun
    Ding, Yongsheng
    Hao, Kuangrong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (12) : 3671 - 3683
  • [44] Fault-tolerant image filter design using particle swarm optimization
    Bao, Zhiguo
    Wang, Fangfang
    Zhao, Xiaoming
    Watanabe, Takahiro
    ARTIFICIAL LIFE AND ROBOTICS, 2011, 16 (03) : 333 - 337
  • [45] Fault-Tolerant Digital Filters on FPGA using Hardware Redundancy Techniques
    Mallavarapu, Prasanth
    Upadhyay, Har Narayan
    Rajkumar, G.
    Elamaran, V.
    2017 INTERNATIONAL CONFERENCE OF ELECTRONICS, COMMUNICATION AND AEROSPACE TECHNOLOGY (ICECA), VOL 2, 2017, : 256 - 259
  • [46] Fault-tolerant Image Filter Design using Particle Swarm Optimization
    Bao, Zhiguo
    Wang, Fangfang
    Zhao, Xiaoming
    Watanabe, Takahiro
    PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL SYMPOSIUM ON ARTIFICIAL LIFE AND ROBOTICS (AROB 16TH '11), 2011, : 653 - 658
  • [47] Single fault-tolerant distributed shared memory using competitive update
    Kim, JH
    Vaidya, NH
    MICROPROCESSORS AND MICROSYSTEMS, 1997, 21 (03) : 183 - 196
  • [48] Fast and Fault-Tolerant Passive Hyperbolic Localization Using Sensor Consensus
    Gyula, Simon
    Zachar, Gergely
    SENSORS, 2024, 24 (09)
  • [49] An efficient fault-tolerant arithmetic logic unit using a novel fault-tolerant 5-input majority gate in quantum-dot cellular automata
    Ahmadpour, Seyed-Sajad
    Mosleh, Mohammad
    Heikalabad, Saeed Rasouli
    COMPUTERS & ELECTRICAL ENGINEERING, 2020, 82
  • [50] Fault Estimation and Fault-Tolerant Control of Wind Turbines Using the SDW-LSI Algorithm
    Wu, Dinghui
    Liu, Wen
    Song, Jin
    Shen, Yanxia
    IEEE ACCESS, 2016, 4 : 7223 - 7231