A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

被引:0
作者
Montezanti, Diego [1 ,3 ]
Rucci, Enzo [4 ]
Rexachs, Dolores [2 ]
Luque, Emilio [2 ]
Naiouf, Marcelo [1 ]
De Giusti, Armando [1 ,4 ]
机构
[1] Univ Nacl La Plata, Sch Comp Sci, III LIDI, Calle 50 & 120, RA-1900 La Plata, Buenos Aires, Argentina
[2] Autonomous Univ Barcelona, Dept Comp Architecture & Operating Syst, E-08193 Barcelona, Spain
[3] Univ Nacl Arturo Jauretche, Florencio Varela, Argentina
[4] Consejo Nacl Invest Cient & Tecn, Buenos Aires, DF, Argentina
来源
JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY | 2014年 / 14卷 / 01期
关键词
Transient fault; parallel scientific application; soft error detection tool; message content validation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
引用
收藏
页码:32 / 38
页数:7
相关论文
共 25 条
[1]  
Andrews G.R., 2000, FDN MULTITHREADED PA
[2]   MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware [J].
Rajanikanth Batchu ;
Yoginder S. Dandass ;
Anthony Skjellum ;
Murali Beddhu .
Cluster Computing, 2004, 7 (4) :303-315
[3]  
Baumann R. C., IEEE 2002 RELIABILIT, P121011
[4]  
DONGARRA J, 2003, SOURCEBOOK PARALLEL
[5]   Process fault tolerance: Semantics, design and applications for high performance computing [J].
Fagg, GE ;
Gabriel, E ;
Chen, ZZ ;
Angskun, T ;
Bosillca, G ;
Pjesivac-Grbovic, J ;
Dongarra, JJ .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2005, 19 (04) :465-477
[6]   Transient-fault recovery for chip multiprocessors [J].
Gomaa, M ;
Scarbrough, C ;
Vijaykurnar, TN ;
Pomeranz, I .
30TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2003, :98-109
[7]  
Graham RL, 2008, LECT NOTES COMPUT SC, V5205, P130, DOI 10.1007/978-3-540-87475-1_21
[8]  
Gramacho Joao, 2011, Proceedings of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications, P645
[9]  
Leibovich F., 2011, AN 17 C ARG CIENC CO, P241
[10]   CONCURRENT ERROR-DETECTION USING WATCHDOG PROCESSORS - A SURVEY [J].
MAHMOOD, A ;
MCCLUSKEY, EJ .
IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (02) :160-174