Fault tolerant wide-area parallel computing

被引:0
|
作者
Weissman, JB [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci & Engn, Minneapolis, MN 55455 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods(1).
引用
收藏
页码:1214 / 1225
页数:12
相关论文
共 50 条
  • [31] Fault identification scheme for wide-area backup protection
    Wang, Yan
    Jin, Jing
    Jiao, Yanjun
    Dianli Zidonghua Shebei/Electric Power Automation Equipment, 2014, 34 (12): : 70 - 75
  • [32] Broadcast scheduling for wide-area parallel distributed systems
    Tasaki, F
    Tamura, H
    Sengoku, M
    Shinoda, S
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 2005, 88 (04): : 11 - 23
  • [33] Fault tolerant programming for network based parallel computing
    Clematis, A., 1600, Elsevier Science B.V., Amsterdam, Netherlands (40): : 10 - 12
  • [34] Ubiquitous access to wide-area high-performance computing
    Burchert, F
    Gatzka, S
    Hochberger, C
    Lee, CK
    Lucke, U
    Tavangarian, D
    TRENDS IN NETWORK AND PERVASIVE COMPUTING - ARCS 2002, 2002, 2299 : 209 - 223
  • [35] Wide-area high-performance computing using workstations
    Tavangarian, D
    Eschholz, P
    Koch, M
    Preuss, S
    24TH EUROMICRO CONFERENCE - PROCEEDING, VOLS 1 AND 2, 1998, : 945 - 952
  • [36] Fault Location using PMU Measurements and Wide-area Infrastructure
    Picard, Stephan D.
    Adamiak, Mark G.
    Madani, Vahid
    2015 68TH ANNUAL CONFERENCE FOR PROTECTIVE RELAY ENGINEERS, 2015, : 272 - 277
  • [37] Fault Location Using Wide-Area Measurements and Sparse Estimation
    Feng, Guangyu
    Abur, Ali
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2016, 31 (04) : 2938 - 2945
  • [38] A Straightforward Method for Wide-Area Fault Location on Transmission Networks
    Azizi, Sadegh
    Sanaye-Pasand, Majid
    IEEE TRANSACTIONS ON POWER DELIVERY, 2015, 30 (01) : 264 - 272
  • [39] A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
    Liu, Shuhao
    Chen, Li
    Li, Baochun
    Carnegie, Aiden
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2018), 2018, : 531 - 539
  • [40] π-Spaces:: Support for decoupled communication in wide-area parallel applications
    Chan, Philip
    Abramson, David
    SIXTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2007, : 3 - +