Graceful degradation in algorithm-based fault tolerant multiprocessor systems

被引：9

作者：

Yajnik, S ^{[1
]}

Jha, NK ^{[1
]}

机构：

[1] PRINCETON UNIV, DEPT ELECT ENGN, PRINCETON, NJ 08544 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 1997年 / 8卷 / 02期

关键词：

algorithm-based fault tolerance; concurrent error detection; concurrent fault location; fault diagnosis; graceful degradation; transient faults;

D O I：

10.1109/71.577256

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t = 1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check_computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system.

引用

页码：137 / 153

页数：17

共 50 条

[1] Analysis and randomized design of algorithm-based fault tolerant multiprocessor systems under an extended model
Yajnik, S
Jha, NK
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1997, 8 (07) : 757 - 768
[2] DESIGN OF ALGORITHM-BASED FAULT-TOLERANT MULTIPROCESSOR SYSTEMS FOR CONCURRENT ERROR-DETECTION AND FAULT-DIAGNOSIS
VINNAKOTA, B
JHA, NK
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1994, 5 (10) : 1099 - 1106
[3] ALGORITHM-BASED FAULT TOLERANCE ON A HYPERCUBE MULTIPROCESSOR
BANERJEE, P
RAHMEH, JT
STUNKEL, C
NAIR, VS
ROY, K
BALASUBRAMANIAN, V
ABRAHAM, JA
IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (09) : 1132 - 1145
[4] Algorithm-based fault location and recovery for matrix computations on multiprocessor systems
RoyChowdhury, A
Banerjee, P
IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (11) : 1239 - 1247
[5] DIAGNOSABILITY AND DIAGNOSIS OF ALGORITHM-BASED FAULT-TOLERANT SYSTEMS
VINNAKOTA, B
JHA, NK
IEEE TRANSACTIONS ON COMPUTERS, 1993, 42 (08) : 924 - 937
[6] An algorithm for optimization of reconfiguration of fault tolerant multiprocessor systems
Janik, P
Kotocova, M
PROCEEDINGS OF THE SIXTH EUROMICRO WORKSHOP ON PARALLEL AND DISTRIBUTED PROCESSING - PDP '98, 1998, : 342 - 348
[7] SYNTHESIS OF ALGORITHM-BASED FAULT-TOLERANT SYSTEMS FOR DEPENDENCE GRAPHS
VINNAKOTA, B
JHA, NK
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1993, 4 (08) : 864 - 874
[8] Automation of fault-tolerant graceful degradation
Yiyan Lin
Sandeep Kulkarni
Arshad Jhumka
Distributed Computing, 2019, 32 : 1 - 25
[9] Automation of fault-tolerant graceful degradation
Lin, Yiyan
Kulkarni, Sandeep
Jhumka, Arshad
DISTRIBUTED COMPUTING, 2019, 32 (01) : 1 - 25
[10] Low-cost fault tolerance in evolvable multiprocessor systems:a graceful degradation approach
Shervin VAKILI
Sied Mehdi FAKHRAIE
Siamak MOHAMMADI
Ali AHMADI
Journal of Zhejiang University(Science A:An International Applied Physics & Engineering Journal), 2009, 10 (06) : 922 - 926

← 1 2 3 4 5 →