Error detection and diagnosis for fault tolerance in distributed systems

被引:5
作者
Saleh, K [1 ]
Al-Saqabi, K [1 ]
机构
[1] Kuwait Univ, Dept Elect & Comp Engn, Safat 13060, Kuwait
关键词
communications software; detection diagnosis; distributed systems; fault tolerance;
D O I
10.1016/S0950-5849(97)00058-X
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. (C) 1998 Elsevier Science B.V.
引用
收藏
页码:975 / 983
页数:9
相关论文
共 50 条
[21]   Fault tolerance of distributed systems by Information Pattern reconfiguration in the publisher/subscriber communication scheme [J].
Staroswiecki, M. ;
Amani, A. Moradi .
2014 EUROPEAN CONTROL CONFERENCE (ECC), 2014, :1975-1980
[22]   Improving fault tolerance in LinuX container-based distributed systems using blockchain [J].
Farahmandian, Masoum ;
Foumani, Mehdi Farrokhbakht ;
Bayat, Peyman .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (04) :5285-5294
[23]   Fault Tolerance Model for Hadoop Distributed System [J].
Ahmed, Soraya Setti ;
Slimani, Yahya ;
Frefita, Riadh .
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2025, 31 (01) :72-92
[24]   Hierarchical error detection in a software implemented fault tolerance (SIFT) environment [J].
Bagchi, S ;
Srinivasan, B ;
Whisnant, K ;
Kalbarczyk, Z ;
Iyer, RK .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (02) :203-224
[25]   Guaranteeing Fault Tolerance in Real Time Systems under Error Bursts [J].
Thomas, Jebin V. ;
Ranjith, R. ;
Pillay, Radhamani V. .
2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING, INSTRUMENTATION AND CONTROL TECHNOLOGIES (ICICICT), 2017, :1480-1484
[26]   Software Implemented Fault Detection And Fault Tolerance Mechanisms - PART II: Experimental evaluation of error [J].
Gawkowski, Piotr ;
Sosnowski, Janusz .
INTERNATIONAL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2005, 51 (03) :495-508
[27]   Evaluation of integrated error processing and fault diagnosis in multiprocessor systems [J].
Di Giandomenico, F ;
Chiaradonna, S ;
Bondavalli, A ;
Grandoni, F .
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, 2000, :1145-1151
[28]   An Adaptive Fault-tolerance Scheme for Distributed Load Balancing Systems [J].
Liu, Dan ;
De Grande, Robson E. ;
Boukerche, Azzedine .
48TH ANNUAL SIMULATION SYMPOSIUM (ANSS 2015), 2015, :138-145
[29]   Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems [J].
Chen, Wuhong ;
Tsai, Jichiang .
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2014, 30 (04) :1167-1186
[30]   Heartbeat Based Error Diagnosis Framework For Distributed Embedded Systems [J].
Mishra, Swagat ;
Khilar, Pabitra Mohan .
FOURTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2011): MACHINE VISION, IMAGE PROCESSING, AND PATTERN ANALYSIS, 2012, 8349