A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems

被引:3
作者
Chuah, Edward [1 ]
Jhumka, Arshad [2 ]
Malek, Miroslaw [3 ]
Suri, Neeraj [4 ]
机构
[1] Univ Aberdeen, Comp Sci Dept, Aberdeen AB24 3FX, Scotland
[2] Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, England
[3] Univ Lugano, Fac Informat, CH-6904 Lugano, Switzerland
[4] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4YW, England
关键词
System log-analysis; log-correlation tools; systematic literature review; quality model; failure diagnosis; failure prediction; cluster systems;
D O I
10.1109/ACCESS.2022.3231454
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
System logs are the first source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate a large volume of heterogeneous data from multiple sub-systems, so the idea of using a single source of data to achieve a given goal, such as identification of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers to perform various analyses (e.g., diagnosing node failures or predicting node failures). Current system log-analysis tools vary significantly in their function and design. We conduct a systematic review of literature on system log-analysis tools and select 46 representative articles out of 3,758 initial articles. To the best of our knowledge, there is no work that studied the characteristics of log-correlation tools (LogCTs) with respect to four quality attributes including (a) spurious correlations, (b) correlation threshold settings, (c) outliers in the data and (d) missing data. In this paper, we (a) propose a quality model to evaluate LogCTs and (b) use this quality model to evaluate and recommend current LogCTs. Through our review, we (a) identify papers on LogCTs, (b) build a quality model consisting of the four quality attributes and (c) discuss several open challenges for future research. Our study highlights the advantages and limitations of existing LogCTs and identifies research opportunities that could facilitate better failure handling in large cluster systems.
引用
收藏
页码:133487 / 133503
页数:17
相关论文
共 52 条
[1]  
Abu-Samah A., 2015, IFAC - Papers Online, V48, P844, DOI 10.1016/j.ifacol.2015.09.632
[2]  
Agresti A., 2009, Statistics: The art and science of learning from data, V2nd
[3]   Sentiment Analysis based Error Detection for Large-Scale Systems [J].
Alharthi, Khalid Ayedh ;
Jhumka, Arshad ;
Di, Sheng ;
Cappello, Franck ;
Chuah, Edward .
51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, :237-249
[4]  
[Anonymous], 2011, International Journal of Computer Science Issues
[5]  
Bagchi S., 2015, FRESCO JOB FAILURE P
[6]   Comprehensive, open-source resource usage measurement and analysis for HPC systems [J].
Browne, James C. ;
DeLeon, Robert L. ;
Patra, Abani K. ;
Barth, William L. ;
Hammond, John ;
Jones, Matthew D. ;
Furlani, Thomas R. ;
Schneider, Barry I. ;
Gallo, Steven M. ;
Ghadersohi, Amin ;
Gentner, Ryan J. ;
Palmer, Jeffrey T. ;
Simakov, Nikolay ;
Innus, Martins ;
Bruno, Andrew E. ;
White, Joseph P. ;
Cornelius, Cynthia D. ;
Yearke, Thomas ;
Marcus, Kyle ;
von Laszewski, Gregor ;
Wang, Fugang .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (13) :2191-2209
[7]   RPTCN: Resource Prediction for High-dynamic Workloads in Clouds based on Deep Learning [J].
Chen, Wenyan ;
Lu, Chengzhi ;
Ye, Kejiang ;
Wang, Yang ;
Xu, Cheng-Zhong .
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, :59-69
[8]  
Chuah E., 2019, PROC IEEE INT C PARA, P1
[9]  
Chuah E., 2022, PROC 11 EUR DEPENDAB, V10, P1
[10]   Failure Diagnosis for Cluster Systems using Partial Correlations [J].
Chuah, Edward ;
Jhumka, Arshad ;
Alt, Samantha ;
Evans, R. Todd ;
Suri, Neeraj .
19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, :1091-1101