Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

被引:24
作者
Jauk, David [1 ]
Yang, Dai [2 ]
Schulz, Martin [2 ]
机构
[1] Tech Univ Munich, Dept Informat, Garching, Germany
[2] Tech Univ Munich, Chair Comp Architecture & Parallel Syst, Garching, Germany
来源
PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2019年
关键词
High Performance Computing; Fault Prediction; Resillience; Exascale Computing; BAYESIAN SERIAL REVISION; FAILURE PREDICTION; CLUSTER;
D O I
10.1145/3295500.3356185
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
引用
收藏
页数:13
相关论文
共 81 条
[21]   Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations [J].
El-Sayed, Nosayba ;
Zhu, Hongyu ;
Schroeder, Bianca .
2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017), 2017, :1333-1344
[22]  
El-Sayed N, 2013, I C DEPEND SYS NETWO
[23]  
Fu XY, 2014, IEEE INT C CL COMP, P103, DOI 10.1109/CLUSTER.2014.6968768
[24]   LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems [J].
Fu, Xiaoyu ;
Ren, Rui ;
Zhan, Jianfeng ;
Zhou, Wei ;
Jia, Zhen ;
Lu, Gang .
2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012), 2012, :71-80
[25]  
Gainaru A., 2012, SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, P1, DOI [10.1109/SC.2012.57, DOI 10.1109/SC.2012.57, 10.1109/SC. 2012.57]
[26]   Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems [J].
Gainaru, Ana ;
Cappello, Franck ;
Kramer, William .
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, :1168-1179
[27]  
Gainaru A, 2011, LECT NOTES COMPUT SC, V6852, P52, DOI 10.1007/978-3-642-23400-2_6
[28]   A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms Big Data Model for Failure Management in Datacenters [J].
Ganguly, Sandipan ;
Consul, Ashish ;
Khan, Ali ;
Bussone, Brian ;
Richards, Jacqueline ;
Miguel, Alejandro .
PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, :105-116
[29]   A survey of high-performance computing scaling challenges [J].
Geist, Al ;
Reed, Daniel A. .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2017, 31 (01) :104-113
[30]  
GHAHRAMANI Z, 2001, HIDDEN MARKOV MODELS, V15, P9