Design and evaluation of hybrid fault-detection systems

被引:91
作者
Reis, GA [1 ]
Chang, J [1 ]
Vachharajani, N [1 ]
Mukherjee, SS [1 ]
Rangan, R [1 ]
August, DI [1 ]
机构
[1] Princeton Univ, Dept Elect Engn & Comp Sci, Princeton, NJ 08540 USA
来源
32ND INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS | 2005年
关键词
D O I
10.1109/ISCA.2005.21
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT and hybrid techniques in general, offer attractive options in the fault-detection design space.
引用
收藏
页码:148 / 159
页数:12
相关论文
共 29 条
  • [1] Baumann R. C., 2001, IEEE Transactions on Device and Materials Reliability, V1, P17, DOI 10.1109/7298.946456
  • [2] BENSO A, 2003, P 9 IEEE INT ON LIN
  • [3] BOSSEN D, 2002, IEEE 2002 REL PHYS S
  • [4] CZECK EW, 1990, P 20 FAULT TOL COMP, P236
  • [5] Transient-fault recovery for chip multiprocessors
    Gomaa, M
    Scarbrough, C
    Vijaykurnar, TN
    Pomeranz, I
    [J]. 30TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2003, : 98 - 109
  • [6] Horst R. W., 1990, Proceedings. The 17th Annual International Symposium on Computer Architecture (Cat. No.90CH2887-8), P216, DOI 10.1109/ISCA.1990.134528
  • [7] Soft error sensitivity characterization for microprocessor dependability enhancement strategy
    Kim, S
    Somani, AK
    [J]. INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2002, : 416 - 425
  • [8] CONCURRENT ERROR-DETECTION USING WATCHDOG PROCESSORS - A SURVEY
    MAHMOOD, A
    MCCLUSKEY, EJ
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (02) : 160 - 174
  • [9] Mukherjee SS, 2003, 36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, P29
  • [10] Detailed design and evaluation of Redundant Multithreading alternatives
    Mukherjee, SS
    Kontz, M
    Reinhardt, SK
    [J]. 29TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2002, : 99 - 110