TRADEOFFS IN THE DESIGN OF EFFICIENT ALGORITHM-BASED ERROR-DETECTION SCHEMES FOR HYPERCUBE MULTIPROCESSORS

被引:11
作者
BALASUBRAMANIAN, V [1 ]
BANERJEE, P [1 ]
机构
[1] UNIV ILLINOIS,COORDINATED SCI LAB,URBANA,IL 61801
关键词
Error coverage experiments; Hypercube multiprocessors; Implementation and evaluation; Parallel algorithms; System level fault detection; Tradeoffs;
D O I
10.1109/32.44381
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Numerous algorithms for computationally intensive tasks have been developed by researchers that are suitable for execution on hypercube multiprocessors. One characteristic of many of these algorithms is that they are extremely structured and are tuned for the highest performance to execute on hypercube architectures. In this paper, we have looked at parallel algorithm design from a different perspective. In many cases, it may be possible to redesign the parallel algorithms using software techniques so as to provide a low-cost on-line scheme for hardware error detection without any hardware modifications. This approach is called Algorithm-based error detection. In the past, we have applied algorithm-based techniques for on-line error detection on the hypercube and have reported some preliminary results of one specific implementation on some applications. In this paper, we provide an in-depth study of the various issues and tradeoffs available in Algorithm-based error detection, as well as a general methodology for evaluating the schemes. We have illustrated the approach on an extremely useful computation in the field of numerical linear algebra: QR factorization. We have implemented and investigated numerous ways of applying algorithm-based error detection using different system-level encoding strategies for QR factorization. Different schemes have been observed to result in varying error coverages and time overheads. We have reported the results of our studies performed on a 16 processor Intel iPSC-2/D4/MX hypercube multiprocessor. © 1990 IEEE
引用
收藏
页码:183 / 196
页数:14
相关论文
共 36 条
  • [1] ABRAHAM JA, 1987, COMPUTER, V20, P65, DOI 10.1109/MC.1987.1663621
  • [2] Andrews D. M., 1979, Ninth Annual International Symposium on Fault-Tolerant Computing, P102
  • [3] A LINEAR ALGEBRAIC MODEL OF ALGORITHM-BASED FAULT TOLERANCE
    ANFINSON, CJ
    LUK, FT
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (12) : 1599 - 1604
  • [4] ARMSTRONG JR, 1981, IEEE T COMPUT, V30, P587, DOI 10.1109/TC.1981.1675844
  • [5] AVIZIENIS A, 1986, P IEEE, V74
  • [6] AYKANAT C, 1987, 17TH P INT S FAULT T, P204
  • [7] BALASUBRAMANIAN V, 1990, IN PRESS IEEE T APR
  • [8] BALASUBRAMANIAN V, 1989, 10TH P REAL TIM SYST
  • [9] Novel approach to system-level fault tolerance in hypercube multiprocessors
    Banerjee, P.
    Stunkel, C.B.
    [J]. Conference on Hypercube Concurrent Computers and Applications, 1988,
  • [10] Banerjee P., 1988, Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers. FTCS-18 (Cat. No.88CH2543-7), P362, DOI 10.1109/FTCS.1988.5344