Fault tolerant wide-area parallel computing

被引:0
|
作者
Weissman, JB [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci & Engn, Minneapolis, MN 55455 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods(1).
引用
收藏
页码:1214 / 1225
页数:12
相关论文
共 50 条
  • [1] Gallop: The benefits of wide-area computing for parallel processing
    Weissman, JB
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1998, 54 (02) : 183 - 205
  • [2] Tapestry: A fault-tolerant wide-area application infrastructure
    Zhao, BY
    Kubiatowicz, JD
    Joseph, AD
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2002, 32 (01) : 81 - 81
  • [3] A Procedure to Design Fault-Tolerant Wide-Area Damping Controllers
    Bento, Murilo E. C.
    Dotta, Daniel
    Kuiava, Roman
    Ramos, Rodrigo A.
    IEEE ACCESS, 2018, 6 : 23383 - 23405
  • [4] Fault-tolerant Wide-area Control for Power Oscillation Damping
    Sevilla, Felix Rafael Segundo
    Jaimoukha, Imad
    Chaudhuri, Balarko
    Korba, Petr
    2012 IEEE POWER AND ENERGY SOCIETY GENERAL MEETING, 2012,
  • [5] Wide-area Nile: A case study of a wide-area data-parallel application
    Amoroso, A
    Marzullo, K
    Ricciardi, A
    18TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 1998, : 506 - 515
  • [6] Distributed center location algorithm for fault-tolerant multicast in wide-area networks
    Ali, S
    Khokhar, A
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 324 - 329
  • [7] Fault-tolerant design for wide-area Mobile IPv6 networks
    Lin, Jenn-Wei
    Yang, Ming-Feng
    JOURNAL OF SYSTEMS AND SOFTWARE, 2009, 82 (09) : 1434 - 1446
  • [8] Computation orchestration - A basis for wide-area computing
    Misra, Jayadev
    Cook, William R.
    SOFTWARE AND SYSTEMS MODELING, 2007, 6 (01): : 83 - 110
  • [9] Computation orchestration - A basis for wide-area computing
    Misra, J
    ENGINEERING THEORIES OF SOFTWARE INTENSIVE SYSTEMS, 2005, 195 : 285 - 330
  • [10] DATA HIGHWAYS FOR WIDE-AREA PROCESS COMPUTING
    HOLDEN, DG
    CHEMICAL ENGINEERING, 1984, 91 (10) : 73 - &