Error detection and diagnosis for fault tolerance in distributed systems

被引:5
作者
Saleh, K [1 ]
Al-Saqabi, K [1 ]
机构
[1] Kuwait Univ, Dept Elect & Comp Engn, Safat 13060, Kuwait
关键词
communications software; detection diagnosis; distributed systems; fault tolerance;
D O I
10.1016/S0950-5849(97)00058-X
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. (C) 1998 Elsevier Science B.V.
引用
收藏
页码:975 / 983
页数:9
相关论文
共 50 条
[41]   Performance tuning policies for application level fault tolerance in distributed object systems [J].
Soldatos, Theodoros ;
Iakovidou, Nantia .
JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2006, 6 (5-6) :S265-S274
[42]   A generic strategy for fault-tolerance in control systems distributed over a network [J].
Patton, R. J. ;
Kambhampati, C. ;
Casavola, A. ;
Zhang, P. ;
Ding, S. ;
Sauter, D. .
EUROPEAN JOURNAL OF CONTROL, 2007, 13 (2-3) :280-296
[43]   Distributed speculative execution for reliability and fault tolerance: an operational semantics [J].
Tapus, Cristian ;
Hickey, Jason .
DISTRIBUTED COMPUTING, 2009, 21 (06) :433-455
[44]   Distributed speculative execution for reliability and fault tolerance: an operational semantics [J].
Cristian Ţăpuş ;
Jason Hickey .
Distributed Computing, 2009, 21 :433-455
[45]   Lazy Repair for Addition of Fault-tolerance to Distributed Programs [J].
Roohitavaf, Mohammad ;
Lin, Yiyan ;
Kulkarni, Sandeep S. .
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, :1071-1080
[46]   A Feature-Oriented Fault Diagnosis Agreement Protocol in Distributed Systems [J].
Hsieh, Hui-Ching ;
Chiang, Mao-Lun ;
Tsai, Wen-Chung ;
Chen, Yen-Chiu .
JOURNAL OF INTERNET TECHNOLOGY, 2019, 20 (05) :1401-1413
[47]   Distributed Fault-Tolerance for Event Detection Using Heterogeneous Wireless Sensor Networks [J].
Ould-Ahmed-Vall, ElMoustapha ;
Ferri, Bonnie Heck ;
Riley, George F. .
IEEE TRANSACTIONS ON MOBILE COMPUTING, 2012, 11 (12) :1994-2007
[48]   Collaborative fault tolerance for cyber-physical systems: The detection stage [J].
Piardi, Luis ;
de Oliveira, Andre Schneider ;
Costa, Pedro ;
Leitao, Paulo .
COMPUTERS IN INDUSTRY, 2025, 166
[49]   Adaptive distributed and fault-tolerant systems [J].
Hiltunen, MA ;
Schlichting, RD .
COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 1996, 11 (05) :275-285
[50]   Fault Tolerance Management in Distributed Systems: A New Leader-Based Consensus Algorithm [J].
Hanna, Fouad ;
Lapayre, Jean-Christophe ;
Droz-Bartholet, Lionel .
2014 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2014, :234-242