On Providing Scalable Self-healing Adaptive Fault-tolerance to RTR SoCs

被引:0
|
作者
Navas, Byron [1 ,2 ]
Oberg, Johnny [1 ]
Sander, Ingo [1 ]
机构
[1] KTH Royal Inst Technol, Dept Elect Syst, Stockholm, Sweden
[2] ESPE Univ Fuerzas Armadas, Dept Elect & Elect Engn, Sangolqui, Ecuador
来源
2014 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG) | 2014年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The dependability of heterogeneous many-core FPGA based systems are threatened by higher failure rates caused by disruptive scales of integration, increased design complexity, and radiation sensitivity. Triple-modular redundancy (TMR) and run-time reconfiguration (RTR) are traditional faulttolerant (FT) techniques used to increase dependability. However, hardware redundancy is expensive and most approaches have poor scalability, flexibility, and programmability. Therefore, innovative solutions are needed to reduce the redundancy cost but still preserve acceptable levels of dependability. In this context, this paper presents the implementation of a self-healing adaptive fault-tolerant SoC that reuses RTR IP-cores in order to self-assemble different TMR schemes during run-time. The presented system demonstrates the feasibility of the Upset-Fault-Observer concept, which provides a run-time self-test and recovery strategy that delivers fault-tolerance over functions accelerated in RTR cores, at the same time reducing the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles. In addition, this paper experimentally evaluates the trade-off of the implemented reconfigurable TMR schemes by characterizing important fault tolerant metrics i.e., recovery time (self-repair and self-replicate), detection latency, self-assembly latency, throughput reduction, and increase of physical resources.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Engineering Adaptive Fault-Tolerance Mechanisms for Resilient Computing on ROS
    Lauer, Michael
    Amy, Matthieu
    Fabre, Jean-Charles
    Roy, Matthieu
    Excoffon, William
    Stoicescu, Miruna
    2016 IEEE 17TH INTERNATIONAL SYMPOSIUM ON HIGH ASSURANCE SYSTEMS ENGINEERING (HASE), 2016, : 94 - 101
  • [32] An Adaptive Fault-tolerance Scheme for Distributed Load Balancing Systems
    Liu, Dan
    De Grande, Robson E.
    Boukerche, Azzedine
    48TH ANNUAL SIMULATION SYMPOSIUM (ANSS 2015), 2015, : 138 - 145
  • [33] Dynamic Composite Web Service Execution by Providing Fault-Tolerance and QoS Monitoring
    Angarita, Rafael
    Rukoz, Marta
    Manouvrier, Maude
    SERVICE-ORIENTED COMPUTING - ICSOC 2014 WORKSHOPS, 2015, 8954 : 371 - 377
  • [34] The method providing fault-tolerance for information and control systems of the industrial mechatronic objects
    Melnik, E. V.
    Klimenko, A. B.
    Korobkin, V. V.
    INTERNATIONAL CONFERENCE ON MECHANICAL ENGINEERING, AUTOMATION AND CONTROL SYSTEMS 2016, 2017, 177
  • [35] Adaptive Application Scaling for Improving Fault-Tolerance and Availability in the Cloud
    Radhakrishnan, Ganesan
    BELL LABS TECHNICAL JOURNAL, 2012, 17 (02) : 5 - 14
  • [36] EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
    Chakraborty, Sourav
    Laguna, Ignacio
    Emani, Murali
    Mohror, Kathryn
    Panda, Dhabaleswar K.
    Schulz, Martin
    Subramoni, Hari
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (03):
  • [37] A self-healing mechanism for an intrusion tolerance system
    Park, B
    Park, K
    Kim, S
    TRUST, PRIVACY, AND SECURITY IN DIGITAL BUSINESS, 2005, 3592 : 41 - 49
  • [38] PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance
    Wang, Cheng
    Chen, Xusheng
    Jia, Weiwei
    Li, Boxuan
    Qiu, Haoran
    Zhao, Shixiong
    Cui, Heming
    PROCEEDINGS OF THE 15TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI'18), 2018, : 483 - 499
  • [39] Fault-Tolerance Mechanism for Self-Reconfiguration of Modular Robots
    Bassil, Jad
    Tannoury, Perla
    Piranda, Benoit
    Makhoul, Abdallah
    Bourgeois, Julien
    2022 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING, IWCMC, 2022, : 360 - 365
  • [40] Strong Fault-Tolerance for Self-Assembly with Fuzzy Temperature
    Doty, David
    Patitz, Matthew J.
    Reishus, Dustin
    Schweller, Robert T.
    Summers, Scott M.
    2010 IEEE 51ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, 2010, : 417 - 426