RELIABLE DISTRIBUTED SORTING THROUGH THE APPLICATION-ORIENTED FAULT TOLERANCE PARADIGM

被引:6
作者
MCMILLIN, BM [1 ]
NI, LM [1 ]
机构
[1] MICHIGAN STATE UNIV,DEPT COMP SCI,ADV COMP SYST LAB,E LANSING,MI 48824
关键词
APPLICATIONS; BYZANTINE MULTICOMPUTERS; FAIL-STOP; FAULT TOLERANCE; PARALLEL SORTING;
D O I
10.1109/71.149960
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Reliability is an important concern in large scale applications such as distributed sorting. An error made due to a hardware or software fault during the application execution can corrupt, perhaps undetectably, the final result of the calculation. Rather than implement the necessary hardware system level reliability by such traditional (and expensive) methods as replication and masking, we appeal to Application-Oriented Fault Tolerance. The Application-Oriented Fault Tolerance Paradigm relies on properties of expected algorithm behavior to form a test for faulty behavior thus also detecting software failures. The test is performed on the actual application's intermediate values by the peers of a particular tested processor. This paper describes the design and implementation of a reliable version of the distributed bitonic sorting algorithm using this paradigm on commercially available multicomputers.
引用
收藏
页码:411 / 420
页数:10
相关论文
共 9 条
[1]  
ATHAS WC, 1988, IEEE COMPUTER AUG, P9
[2]  
BATCHER KE, 1968 P SPR JOINT COM, V32, P307
[3]  
EVEN S, GRAPH ALGORITHM, P212
[4]  
JOU JY, 1986, P IEEE, V74, P732, DOI 10.1109/PROC.1986.13535
[5]   THE BYZANTINE GENERALS PROBLEM [J].
LAMPORT, L ;
SHOSTAK, R ;
PEASE, M .
ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 1982, 4 (03) :382-401
[6]  
MCMILLIN B, 1988, 12TH P INT COMPSAC C, P284
[7]  
McMillin B. M., 1989, Parallel Processing for Computer Vision and Display, P190
[8]  
QUINN MJ, 1987, DESIGNING EFFICIENT
[9]  
STUCKI LG, 1977, CURRENT TRENDS PROGR, V2, P80