Scalable Approach to Failure Analysis of High-Performance Computing Systems

被引:2
作者
Shawky, Doaa [1 ]
机构
[1] Cairo Univ, Dept Engn Math, Cairo, Egypt
关键词
Failure analysis; high-performance computing; rough sets theory; LARGE-SCALE;
D O I
10.4218/etrij.14.0113.1133
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.
引用
收藏
页码:1023 / 1031
页数:9
相关论文
共 23 条
[1]  
[Anonymous], 2012, RAW FAIL DAT
[2]  
Brandt J., 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), P2, DOI 10.1109/DSNW.2010.5542629
[3]  
Gibson G., 2007, CTWATCH Q, V3, P4
[4]  
Hampton J., 1997, J COMPUT INTELL FINA, V5, P25
[5]  
Hernandez-Diaz AG, 2006, GECCO 2006: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, VOL 1 AND 2, P675
[6]  
Hu X., 2004, FUNDAM INFORM, V59, P125
[7]   Performance comparison under failures of MPI and MapReduce: An analytical approach [J].
Jin, Hui ;
Sun, Xian-He .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (07) :1808-1815
[8]  
Komorowski J., 1999, Rough Fuzzy Hybridization: A New Trend in Decision Making, P3
[9]  
*LAB INT DEC SUPP, ROSE2 ROUGH SETS DAT
[10]   Fault-Aware Runtime Strategies for High-Performance Computing [J].
Li, Yawei ;
Lan, Zhiling ;
Gujrati, Prashasta ;
Sun, Xian-He .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (04) :460-473