FAULT-TOLERANT PARALLEL PROGRAMMING IN ARGUS

被引:1
作者
BAL, HE [1 ]
机构
[1] FREE UNIV AMSTERDAM,DEPT MATH & COMP SCI,1081 HV AMSTERDAM,NETHERLANDS
来源
CONCURRENCY-PRACTICE AND EXPERIENCE | 1992年 / 4卷 / 01期
关键词
D O I
10.1002/cpe.4330040104
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Fault tolerance is an issue ignored in most parallel languages. The overhead of making parallel, high-performance programs resilient to processor crashes is often too high, given the low probability of such events. If parallel systems become more large-scaled, however, processor failures will become likely, so they should be dealt with. Two approaches to this problem are feasible. First, the system can make programs fault-tolerant transparently. It can log messages, make checkpoints, and so on. Second, the programmer can write explicit code for handling failures in an application-specific way. The latter approach is potentially more efficient, but also requires more work from the programmer. In this paper, we intend to get some initial insight into how hard and efficient explicit fault-tolerant parallel programming is. We do so by implementing four parallel applications in Argus, a language supporting parallelism as well as fault tolerance. Our experiences indicate that the extra effort needed for fault tolerance varies much between different applications. Also, trade-offs can frequently be made between programming effort and efficiency. One lesson we learned is that fault tolerance should not be added as an afterthought, but is best taken into account from the start. As another result, the ability to integrate transparent and explicit mechanisms for fault tolerance would sometimes be highly useful.
引用
收藏
页码:37 / 55
页数:19
相关论文
共 22 条
  • [1] BAL H, 1990, PROGRAMMING DISTRIBU
  • [2] BAL HE, 1988, 1988 P IEEE CS INT C, P82
  • [3] BAL HE, 1990, IEEE CS INT C COMP L, P79
  • [4] BAL HE, 1989, OCT USENIX SERC WORK, P1
  • [5] BORG A, 1983, 9TH P ACM S OP SYST, P90
  • [6] DAY MS, 1987, TR376 MIT REP
  • [7] DIB - A DISTRIBUTED IMPLEMENTATION OF BACKTRACKING
    FINKEL, R
    MANBER, U
    [J]. ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 1987, 9 (02): : 235 - 256
  • [8] Greif I., 1986, 13TH P ACM S PRINC P, P160
  • [9] Horiguchi S., 1986, 6th International Conference on Distributed Computing Systems Proceedings (Cat. No. 86CH2293-9), P111
  • [10] Jenq J.-F., 1987, Proceedings of the 1987 International Conference on Parallel Processing, P713