Fault tolerant wide-area parallel computing

被引:0
|
作者
Weissman, JB [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci & Engn, Minneapolis, MN 55455 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods(1).
引用
收藏
页码:1214 / 1225
页数:12
相关论文
共 50 条
  • [21] Wide-area distributed applications in high performance computing
    Overeinder, BJ
    Sips, HJ
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, 2001, 17 (06): : 767 - 768
  • [22] Object-oriented programming for wide-area computing
    Misra, J
    FORMAL METHODS FOR OPEN OBJECT-BASED DISTRIBUTED SYSTEMS IV, 2000, 49 : 209 - 209
  • [23] Fault tolerant-based virtual actuator design for wide-area damping control in power system
    D. V. Nair
    M. S. R. Murty
    Electrical Engineering, 2021, 103 : 463 - 477
  • [24] MatchTree: Flexible, scalable, and fault-tolerant wide-area resource discovery with distributed matchmaking and aggregation
    Lee, Kyungyong
    Choi, Taewoong
    Boykin, Patrick Oscar
    Figueiredo, Renato J.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (06): : 1596 - 1610
  • [25] Wide-area computing: Resource sharing on a large scale
    Grimshaw, A
    Ferrari, A
    Knabe, F
    Humphrey, M
    COMPUTER, 1999, 32 (05) : 29 - +
  • [26] On the performance of wide-area thin-client computing
    Lai, Albert M.
    Nieh, Jason
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2006, 24 (02): : 175 - 209
  • [27] The replica management for wide-area distributed computing environments
    No, Jaechun
    Park, Chang Won
    Park, Sung Soon
    NEXT GENERATION INFORMATION TECHNOLOGIES AND SYSTEMS, PROCEEDINGS, 2006, 4032 : 237 - 248
  • [28] A Novel Wide-area Fault Location Algorithm Based on Fault Model
    Ma, Jing
    Li, Jin-long
    Wang, Zeng-ping
    Yang, Qi-Xun
    2010 ASIA-PACIFIC POWER AND ENERGY ENGINEERING CONFERENCE (APPEEC), 2010,
  • [29] A novel wide-area fault location algorithm based on fault model
    Ma, Jing
    Li, Jin-Long
    Li, Jin-Hui
    Yang, Qi-Xun
    Wang, Zeng-Ping
    Dianli Xitong Baohu yu Kongzhi/Power System Protection and Control, 2010, 38 (20): : 74 - 78
  • [30] Fault Location Method Based on Wide-Area Voltage
    Xu, Yan
    Ying, Lu-man
    Zhi, Jing
    Feng, Ren-qing
    ENERGY DEVELOPMENT, PTS 1-4, 2014, 860-863 : 2077 - +