On Providing Scalable Self-healing Adaptive Fault-tolerance to RTR SoCs

被引：0

作者：

Navas, Byron ^{[1
,2
]}

Oberg, Johnny ^{[1
]}

Sander, Ingo ^{[1
]}

机构：

[1] KTH Royal Inst Technol, Dept Elect Syst, Stockholm, Sweden

[2] ESPE Univ Fuerzas Armadas, Dept Elect & Elect Engn, Sangolqui, Ecuador

来源：

2014 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG) | 2014年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The dependability of heterogeneous many-core FPGA based systems are threatened by higher failure rates caused by disruptive scales of integration, increased design complexity, and radiation sensitivity. Triple-modular redundancy (TMR) and run-time reconfiguration (RTR) are traditional faulttolerant (FT) techniques used to increase dependability. However, hardware redundancy is expensive and most approaches have poor scalability, flexibility, and programmability. Therefore, innovative solutions are needed to reduce the redundancy cost but still preserve acceptable levels of dependability. In this context, this paper presents the implementation of a self-healing adaptive fault-tolerant SoC that reuses RTR IP-cores in order to self-assemble different TMR schemes during run-time. The presented system demonstrates the feasibility of the Upset-Fault-Observer concept, which provides a run-time self-test and recovery strategy that delivers fault-tolerance over functions accelerated in RTR cores, at the same time reducing the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles. In addition, this paper experimentally evaluates the trade-off of the implemented reconfigurable TMR schemes by characterizing important fault tolerant metrics i.e., recovery time (self-repair and self-replicate), detection latency, self-assembly latency, throughput reduction, and increase of physical resources.

引用

页数：6

共 50 条

[1] The Upset-Fault-Observer: A Concept for Self-healing Adaptive Fault Tolerance
Navas, Byron
Oberg, Johnny
Sander, Ingo
2014 NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS (AHS), 2014, : 89 - 96
[2] Fault-tolerance by regeneration: Using development to achieve robust self-healing neural networks
Federici, D
PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 2808 - 2813
[3] Component-based Self-Healing Algorithm with Dynamic Range Allocation for Fault-Tolerance in WSN
Begum, Beneyaz A.
Nandury, Satyanarayana V.
7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGY (ICCCT - 2017), 2017, : 58 - 65
[4] Fault-tolerance Properties and Self-healing Abilities Implementation in FPGA-based Embryonic Hardware Systems
Szasz, Cs.
Chindris, V.
2009 7TH IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS, VOLS 1 AND 2, 2009, : 155 - 160
[5] Self-healing and Fault-tolerance Abilities Development in Embryonic Systems Implemented with FPGA-based Hardware
Szasz, Cs.
Chindris, V.
2009 INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS, 2009, : 196 - 201
[6] Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication
Chtepen, Maria
Claeys, Filip H. A.
Dhoedt, Bart
De Turck, Filip
Vanrolleghem, Peter A.
Demeester, Piet
COMPUTATIONAL SCIENCE - ICCS 2007, PT 1, PROCEEDINGS, 2007, 4487 : 454 - +
[7] Self-healing Fault Tolerance Technique in Cloud Datacenter
Devi, R. Kanniga
Muthukannan, M.
PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT 2021), 2021, : 731 - 737
[8] Self-healing network for scalable fault tolerant runtime environments
Angskun, Thara
Fagg, Graham E.
Bosilca, George
Pjesivac-Grbovic, Jelena
Dongarra, Jack J.
DISTRIBUTED AND PARALLEL SYSTEMS: FROM CLUSTER TO GRID COMPUTING, 2007, : 73 - 80
[9] FAULT TOLERANCE AND SELF-HEALING IN OPTICAL SYSTOLIC ARRAY PROCESSORS
CAULFIELD, HJ
PUTNAM, RS
OPTICAL ENGINEERING, 1985, 24 (01) : 65 - 67
[10] Simplifying fault-tolerance: Providing the abstraction of crash failures
Bazzi, RA
Neiger, G
JOURNAL OF THE ACM, 2001, 48 (03) : 499 - 554

← 1 2 3 4 5 →