Failure prediction using machine learning in a virtualised HPC system and application

被引:1
作者
Bashir Mohammed
Irfan Awan
Hassan Ugail
Muhammad Younas
机构
[1] University of Bradford,School of Electrical Engineering and Computer Science
[2] Oxford Brookes University,Department of Computing & Communication Technologies
来源
Cluster Computing | 2019年 / 22卷
关键词
Failure; Machine learning; High performance computing; Cloud computing;
D O I
暂无
中图分类号
学科分类号
摘要
Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.
引用
收藏
页码:471 / 485
页数:14
相关论文
共 50 条
[31]   Prediction System for Prostate Cancer Recurrence Using Machine Learning [J].
Lee, Sun Jung ;
Yu, Sung Hye ;
Kim, Yejin ;
Kim, Jae Kwon ;
Hong, Jun Hyuk ;
Kim, Choung-Soo ;
Seo, Seong Il ;
Byun, Seok-Soo ;
Jeong, Chang Wook ;
Lee, Ji Youl ;
Choi, In Young .
APPLIED SCIENCES-BASEL, 2020, 10 (04)
[32]   Traffic Prediction for Intelligent Transportation System Using Machine Learning [J].
Swathi, V ;
Yerraboina, Sirisha ;
Mallikarjun, G. ;
JhansiRani, M. .
2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
[33]   ml-SFP: System Failure Prediction Method Based on Machine Learning [J].
Seo, Hyungjun ;
No, Jaechun ;
Park, Sung-soon .
INTELLIGENT SUSTAINABLE SYSTEMS, WORLDS4 2022, VOL 2, 2023, 579 :195-203
[34]   Machine Learning Based Hardware Model for a Biomedical System for Prediction of Respiratory Failure [J].
Hassan, Omiya ;
Shamsir, Samira ;
Islam, Syed K. .
2020 IEEE INTERNATIONAL SYMPOSIUM ON MEDICAL MEASUREMENTS AND APPLICATIONS (MEMEA), 2020,
[35]   Block size estimation for data partitioning in HPC applications using machine learning techniques [J].
Riccardo Cantini ;
Fabrizio Marozzo ;
Alessio Orsino ;
Domenico Talia ;
Paolo Trunfio ;
Rosa M. Badia ;
Jorge Ejarque ;
Fernando Vázquez-Novoa .
Journal of Big Data, 11
[36]   Block size estimation for data partitioning in HPC applications using machine learning techniques [J].
Cantini, Riccardo ;
Marozzo, Fabrizio ;
Orsino, Alessio ;
Talia, Domenico ;
Trunfio, Paolo ;
Badia, Rosa M. ;
Ejarque, Jorge ;
Vazquez-Novoa, Fernando .
JOURNAL OF BIG DATA, 2024, 11 (01)
[37]   A Machine Learning Approach to Database Failure Prediction [J].
Karakurt, Ismet ;
Ozer, Sertay ;
Ulusinan, Taner ;
Ganiz, Murat Can .
2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, :1030-1035
[38]   Optimizing Machine Learning on Apache Spark in HPC Environments [J].
Li, Zhenyu ;
Davis, James ;
Jarvis, Stephen A. .
PROCEEDINGS OF 2018 IEEE/ACM MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC 2018), 2018, :95-105
[39]   Web-Based Maintenance Prediction of Machine Conditions and Failure Modes Using Machine Learning [J].
Al-Refaie, Abbas ;
Al-Atrash, Majd ;
Melhem, Abdullah ;
Lepkova, Natalija .
JOURNAL OF ADVANCED MANUFACTURING SYSTEMS, 2025, 24 (02) :359-383
[40]   Application of Selected Machine Learning Methods to Companies' Insolvency Prediction [J].
Wyrobek, Joanna .
EUROPEAN FINANCIAL SYSTEMS 2018: PROCEEDINGS OF THE 15TH INTERNATIONAL SCIENTIFIC CONFERENCE, 2018, :839-848