Fault tolerant wide-area parallel computing

被引:0
|
作者
Weissman, JB [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci & Engn, Minneapolis, MN 55455 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods(1).
引用
收藏
页码:1214 / 1225
页数:12
相关论文
共 50 条
  • [41] FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing
    Yang, Xuejun
    Du, Yunfei
    Wang, Panfeng
    Fu, Hongyi
    Jia, Jia
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (10) : 1471 - 1486
  • [42] An analytical model for a parallel fault-tolerant computing system
    Personè, VD
    Grassi, V
    PERFORMANCE EVALUATION, 1999, 38 (3-4) : 201 - 218
  • [43] A fault-tolerant computing method for Xdraw parallel algorithm
    Wanfeng Dou
    Yanan Li
    The Journal of Supercomputing, 2018, 74 : 2776 - 2800
  • [44] WIDE-AREA COLLABORATION
    PRESS, L
    COMMUNICATIONS OF THE ACM, 1991, 34 (12) : 21 - 24
  • [45] A fault-tolerant computing method for Xdraw parallel algorithm
    Dou, Wanfeng
    Li, Yanan
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (06): : 2776 - 2800
  • [46] Study on wide-area traveling wave fault line selection and fault location algorithm
    Li, Zhenxing
    Cheng, Yixing
    Wang, Xin
    Li, Zhenhua
    Weng, Hanli
    INTERNATIONAL TRANSACTIONS ON ELECTRICAL ENERGY SYSTEMS, 2018, 28 (12):
  • [47] Modeling machine availability in enterprise and wide-area distributed computing environments
    Nurmi, D
    Brevik, J
    Wolski, R
    EURO-PAR 2005 PARALLEL PROCESSING, PROCEEDINGS, 2005, 3648 : 432 - 441
  • [48] Application-Specific Resource Provisioning for Wide-Area Distributed Computing
    Liu, Xin
    Qiao, Chunming
    Yu, Dantong
    Jiang, Tao
    IEEE NETWORK, 2010, 24 (04): : 25 - 34
  • [49] Application of synchronised phasor measurements to wide-area fault diagnosis and location
    Salehi-Dobakhshari, Ahmad
    Ranjbar, Ali Mohammad
    IET GENERATION TRANSMISSION & DISTRIBUTION, 2014, 8 (04) : 716 - 729
  • [50] Building Autonomically Scalable Services on Wide-Area Shared Computing Platforms
    Padhye, Vinit
    Tripathi, Anand
    2011 10TH IEEE INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS (NCA), 2011,