Application-level fault tolerance as a complement to system-level fault tolerance

被引:14
|
作者
Haines, J [1 ]
Lakamraju, V [1 ]
Koren, I [1 ]
Krishna, CM [1 ]
机构
[1] Univ Massachusetts, Dept Elect & Comp Engn, Amherst, MA 01003 USA
关键词
distributed real-time systems; fault tolerance; checkpointing; imprecise computation; target tracking; beam forming;
D O I
10.1023/A:1008181429693
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.
引用
收藏
页码:53 / 68
页数:16
相关论文
共 50 条
  • [31] Fault tolerance in HPC scientific workflow application
    Li Y.
    Mo Z.
    Xiao Y.
    Zhao S.
    Duan B.
    Mo, Zeyao (zeyao_mo@iapcm.ac.cn), 2020, National University of Defense Technology (42): : 82 - 89
  • [32] Construction of Permissible Functions and their Application for Fault Tolerance
    Golubeva, Olga
    2019 INTERNATIONAL SIBERIAN CONFERENCE ON CONTROL AND COMMUNICATIONS (SIBCON), 2019,
  • [33] Supporting Reconfigurable Fault Tolerance on Application Servers
    Li, Junguo
    Huang, Gang
    Chen, Xingrun
    Chauvel, Franck
    Mei, Hong
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS, PROCEEDINGS, 2009, : 263 - 271
  • [34] A PARALLEL PROBABILISTIC SYSTEM-LEVEL FAULT DIAGNOSIS APPROACH FOR LARGE MULTIPROCESSOR SYSTEMS
    Elhadef, Mourad
    Abrougui, Kaouther
    Das, Shantanu
    Nayak, Amiya
    PARALLEL PROCESSING LETTERS, 2006, 16 (01) : 63 - 79
  • [35] Fault Tolerance as a Service
    Nandi, Bipin B.
    Paul, Himadri Sekhar
    Banerjee, Ansuman
    Ghosh, Sasthi C.
    2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2013), 2013, : 446 - 453
  • [36] Fault Tolerance on NoCs
    Montanana, J. M.
    de Andres, D.
    Tirado, F.
    2013 IEEE 27TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (WAINA), 2013, : 138 - 143
  • [37] Fault Tolerance Model for Hadoop Distributed System
    Ahmed, Soraya Setti
    Slimani, Yahya
    Frefita, Riadh
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2025, 31 (01) : 72 - 92
  • [38] FAULT TOLERANCE OF SPACECRAFT ORIENTATION AND STABILIZATION SYSTEM
    Firsov, S. N.
    Reznikova, O. V.
    RADIO ELECTRONICS COMPUTER SCIENCE CONTROL, 2013, 2 : 103 - 111
  • [39] A PERFORMANCE ANALYSIS OF A BUDDY SYSTEM FOR FAULT TOLERANCE
    FINKEL, D
    TRIPATHI, SK
    PERFORMANCE EVALUATION, 1990, 11 (03) : 177 - 185
  • [40] Three-level converter topologies with switch breakdown fault-tolerance capability
    Ceballos, Salvador
    Pou, Josep
    Robles, Eider
    Gabiola, Igor
    Zaragoza, Jordi
    Villate, Jose Luis
    Boroyevich, Dushan
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2008, 55 (03) : 982 - 995