REAL-NUMBER CODES FOR FAULT-TOLERANT MATRIX OPERATIONS ON PROCESSOR ARRAYS

被引:96
作者
NAIR, VSS
ABRAHAM, JA
机构
[1] UNIV TEXAS,COMP ENGN RES CTR,AUSTIN,TX 78758
[2] UNIV TEXAS,DEPT ELECT & COMP ENGN,AUSTIN,TX 78758
基金
美国国家航空航天局;
关键词
Algorithm-based fault tolerance; error detectability; finite-field codes; numerical errors; processor arrays; real-number codes;
D O I
10.1109/12.54836
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to inflexibility of the encoding schemes and also due to potential numerical problems. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this paper, we develop a generalization of the existing schemes as a possible solution to these shortcomings. We prove that linearity is a necessary and sufficient condition for codes used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU decomposition. We also prove that for every linear code defined over a finite field, there exists a corresponding linear real-number code with similar error detecting and correcting capabilities. Encoding schemes are given for some of the example codes which fall under the general set of real-number codes. With the help of experiments, we derive a rule of thumb for the selection of a particular code for a given application. The performance overhead of fault tolerance schemes using the generalized encoding schemes is shown to be very low, and this was substantiated through simulation experiments. Since the overall error in the code will also depend on the method of implementation of the coding scheme, we also suggest the use of specific algorithms and special hardware realizations for the check element computation. © 1990 IEEE
引用
收藏
页码:426 / 435
页数:10
相关论文
共 31 条
[1]  
ABRAHAM JA, 1986, SPIE HIGHLY PARALLEL, V614, P49
[2]   FAULT-TOLERANCE - SURVIVAL ATTRIBUTE OF DIGITAL SYSTEMS [J].
AVIZIENIS, A .
PROCEEDINGS OF THE IEEE, 1978, 66 (10) :1109-1125
[3]  
BANERJEE P, 1984, 11TH P ANN S COMP AR
[4]  
BANERJEE P, IN PRESS IEEE T COMP
[5]  
BLAHUT RE, 1984, THEORY PRACTICE ERRO
[6]  
BOSE B, 1982, IEEE T COMPUT, V31, P521, DOI 10.1109/TC.1982.1676034
[7]  
BUTNER SE, 1981, CSL TR211 STANF U DE
[8]  
CHEN CY, 1986, P SPIE ADV ALG ARCH, V696, P228
[9]  
Cohen A.M., 1973, NUMERICAL ANAL
[10]  
CURTIS CW, 1984, LINEAR ALGEBRA