On Providing Scalable Self-healing Adaptive Fault-tolerance to RTR SoCs

被引:0
|
作者
Navas, Byron [1 ,2 ]
Oberg, Johnny [1 ]
Sander, Ingo [1 ]
机构
[1] KTH Royal Inst Technol, Dept Elect Syst, Stockholm, Sweden
[2] ESPE Univ Fuerzas Armadas, Dept Elect & Elect Engn, Sangolqui, Ecuador
来源
2014 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG) | 2014年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The dependability of heterogeneous many-core FPGA based systems are threatened by higher failure rates caused by disruptive scales of integration, increased design complexity, and radiation sensitivity. Triple-modular redundancy (TMR) and run-time reconfiguration (RTR) are traditional faulttolerant (FT) techniques used to increase dependability. However, hardware redundancy is expensive and most approaches have poor scalability, flexibility, and programmability. Therefore, innovative solutions are needed to reduce the redundancy cost but still preserve acceptable levels of dependability. In this context, this paper presents the implementation of a self-healing adaptive fault-tolerant SoC that reuses RTR IP-cores in order to self-assemble different TMR schemes during run-time. The presented system demonstrates the feasibility of the Upset-Fault-Observer concept, which provides a run-time self-test and recovery strategy that delivers fault-tolerance over functions accelerated in RTR cores, at the same time reducing the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles. In addition, this paper experimentally evaluates the trade-off of the implemented reconfigurable TMR schemes by characterizing important fault tolerant metrics i.e., recovery time (self-repair and self-replicate), detection latency, self-assembly latency, throughput reduction, and increase of physical resources.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] The Upset-Fault-Observer: A Concept for Self-healing Adaptive Fault Tolerance
    Navas, Byron
    Oberg, Johnny
    Sander, Ingo
    2014 NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS (AHS), 2014, : 89 - 96
  • [2] Fault-tolerance by regeneration: Using development to achieve robust self-healing neural networks
    Federici, D
    PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 2808 - 2813
  • [3] Component-based Self-Healing Algorithm with Dynamic Range Allocation for Fault-Tolerance in WSN
    Begum, Beneyaz A.
    Nandury, Satyanarayana V.
    7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGY (ICCCT - 2017), 2017, : 58 - 65
  • [4] Fault-tolerance Properties and Self-healing Abilities Implementation in FPGA-based Embryonic Hardware Systems
    Szasz, Cs.
    Chindris, V.
    2009 7TH IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS, VOLS 1 AND 2, 2009, : 155 - 160
  • [5] Self-healing and Fault-tolerance Abilities Development in Embryonic Systems Implemented with FPGA-based Hardware
    Szasz, Cs.
    Chindris, V.
    2009 INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS, 2009, : 196 - 201
  • [6] Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication
    Chtepen, Maria
    Claeys, Filip H. A.
    Dhoedt, Bart
    De Turck, Filip
    Vanrolleghem, Peter A.
    Demeester, Piet
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 1, PROCEEDINGS, 2007, 4487 : 454 - +
  • [7] Self-healing Fault Tolerance Technique in Cloud Datacenter
    Devi, R. Kanniga
    Muthukannan, M.
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT 2021), 2021, : 731 - 737
  • [8] Self-healing network for scalable fault tolerant runtime environments
    Angskun, Thara
    Fagg, Graham E.
    Bosilca, George
    Pjesivac-Grbovic, Jelena
    Dongarra, Jack J.
    DISTRIBUTED AND PARALLEL SYSTEMS: FROM CLUSTER TO GRID COMPUTING, 2007, : 73 - 80
  • [9] FAULT TOLERANCE AND SELF-HEALING IN OPTICAL SYSTOLIC ARRAY PROCESSORS
    CAULFIELD, HJ
    PUTNAM, RS
    OPTICAL ENGINEERING, 1985, 24 (01) : 65 - 67
  • [10] Simplifying fault-tolerance: Providing the abstraction of crash failures
    Bazzi, RA
    Neiger, G
    JOURNAL OF THE ACM, 2001, 48 (03) : 499 - 554