Failure Diagnosis for Cluster Systems using Partial Correlations

被引:3
作者
Chuah, Edward [1 ]
Jhumka, Arshad [2 ]
Alt, Samantha [3 ]
Evans, R. Todd [4 ]
Suri, Neeraj [1 ]
机构
[1] Univ Lancaster, Lancaster LA1 4YW, England
[2] Univ Warwick, Coventry CV4 7AL, W Midlands, England
[3] Intel Corp, Santa Clara, CA 95051 USA
[4] Texas Adv Comp Ctr, Austin, TX 78758 USA
来源
19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021) | 2021年
基金
美国国家科学基金会; 欧盟地平线“2020”; 英国工程与自然科学研究理事会;
关键词
HPC systems; Failure Diagnosis; Feature extraction; Partial correlation; Resource use data and system logs; ALGORITHMS; PCA;
D O I
10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Failures have expensive implications in HPC (High-erformance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called WADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previously unknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. WADE has been put on the public domain to support system administrators in failure diagnosis.
引用
收藏
页码:1091 / 1101
页数:11
相关论文
共 48 条
[1]  
Agresti A., 2009, Statistics: The art and science of learning from data, V2nd
[2]  
[Anonymous], 2012, P ACM INT C AUT COMP, DOI DOI 10.1145/2371536.2371571
[3]   DILAF: A framework for distributed analysis of large-scale system logs for anomaly detection [J].
Astekin, Merve ;
Zengin, Harun ;
Sozer, Hasan .
SOFTWARE-PRACTICE & EXPERIENCE, 2019, 49 (02) :153-170
[4]   Basic concepts and taxonomy of dependable and secure computing [J].
Avizienis, A ;
Laprie, JC ;
Randell, B ;
Landwehr, C .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (01) :11-33
[5]   Partial correlation and conditional correlation as measures of conditional independence [J].
Baba, K ;
Shibata, R ;
Sibuya, M .
AUSTRALIAN & NEW ZEALAND JOURNAL OF STATISTICS, 2004, 46 (04) :657-664
[6]  
Bhatele A., 2017, P IEEE ACM SUP SC
[7]  
Brown J., 2013, Proceedings of the 2Nd ACM Workshop on Hot Topics on Wireless Network Security and Privacy, P1, DOI [DOI 10.1109/CIG.2013.6633659, DOI 10.1145/2463183]
[8]  
Chakravarty A., 2018, OPENCOMPUTE PROJ US
[9]  
Chuah E., 2019, PROC IEEE INT C PARA, P1
[10]  
Chuah E., 2016, P IEEE HIPC