ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors

被引:90
作者
Prvulovic, M [1 ]
Zhang, Z [1 ]
Torrellas, J [1 ]
机构
[1] Univ Illinois, Hewlett Packard Labs, Urbana, IL 60680 USA
来源
29TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS | 2002年
关键词
D O I
10.1109/ISCA.2002.1003567
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day.
引用
收藏
页码:111 / 122
页数:12
相关论文
共 34 条
[1]  
AHMED RE, 1990, P 20 INT S FAULT TOL, P82
[2]  
[Anonymous], 5 CAECW FEB
[3]   DIVA: A reliable substrate for deep submicron microarchitecture design [J].
Austin, TM .
32ND ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, (MICRO-32), PROCEEDINGS, 1999, :196-207
[4]   An architecture for tolerating processor failures in shared-memory multiprocessors [J].
Banatre, M ;
Gefflaut, A ;
Joubert, P ;
Morin, C ;
Lee, PA .
IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (10) :1101-1115
[5]  
BANATRE M, 1990, P 20 S FAULT TOL COM, P89
[6]   MANETHO - TRANSPARENT ROLLBACK-RECOVERY WITH LOW OVERHEAD, LIMITED ROLLBACK, AND FAST OUTPUT COMMIT [J].
ELNOZAHY, EN ;
ZWAENEPOEL, W .
IEEE TRANSACTIONS ON COMPUTERS, 1992, 41 (05) :526-531
[7]  
ELNOZAHY M, 1999, CMUCS99148
[8]  
KERMARREC AM, 1995, DIG PAP INT SYMP FAU, P289, DOI 10.1109/FTCS.1995.466970
[9]   A direct-execution framework for fast and accurate simulation of superscalar processors [J].
Krishnan, V ;
Torrellas, J .
1998 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PROCEEDINGS, 1998, :286-293
[10]  
KUFRIN R, 1999, BARRIER SYNCHRONIZAT