Failure prediction using machine learning in a virtualised HPC system and application

被引:1
作者
Bashir Mohammed
Irfan Awan
Hassan Ugail
Muhammad Younas
机构
[1] University of Bradford,School of Electrical Engineering and Computer Science
[2] Oxford Brookes University,Department of Computing & Communication Technologies
来源
Cluster Computing | 2019年 / 22卷
关键词
Failure; Machine learning; High performance computing; Cloud computing;
D O I
暂无
中图分类号
学科分类号
摘要
Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.
引用
收藏
页码:471 / 485
页数:14
相关论文
共 50 条
[11]   Heart Failure Disease Prediction Using Machine Learning Models [J].
Tiburcio, Paola ;
Guerrero, Victor ;
Ponce, Hiram .
ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2022, PT I, 2022, 13612 :183-191
[12]   Prediction of compressive strength of high-performance concrete (HPC) using machine learning algorithms [J].
Imran, Muhammad ;
Raza, Ali ;
Touqeer, Muhammad .
MULTISCALE AND MULTIDISCIPLINARY MODELING EXPERIMENTS AND DESIGN, 2024, 7 (03) :1881-1894
[13]   A Survey on Hardware Failure Prediction of Servers Using Machine Learning and Deep Learning [J].
Georgoulopoulos, Nikolaos ;
Hatzopoulos, Alkiviadis ;
Karamitsios, Konstantinos ;
Tabakis, Irene Maria ;
Kotrotsios, Konstantinos ;
Metsai, Alexandros, I .
2021 10TH INTERNATIONAL CONFERENCE ON MODERN CIRCUITS AND SYSTEMS TECHNOLOGIES (MOCAST), 2021,
[14]   Cloud failure prediction based on traditional machine learning and deep learning [J].
Tengku Nazmi Tengku Asmawi ;
Azlan Ismail ;
Jun Shen .
Journal of Cloud Computing, 11
[15]   Cloud failure prediction based on traditional machine learning and deep learning [J].
Asmawi, Tengku Nazmi Tengku ;
Ismail, Azlan ;
Shen, Jun .
JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2022, 11 (01)
[16]   Machine Learning Regression-Based Prediction for Improving Performance and Energy Consumption in HPC Platforms [J].
Coelho, Micaella ;
Ocana, Kary ;
Pereira, Andre ;
Porto, Alexandre ;
Cardoso, Douglas O. ;
Lorenzon, Arthur ;
Oliveira, Rui ;
Navaux, Philippe O. A. ;
Osthoff, Carla .
HIGH PERFORMANCE COMPUTING, CARLA 2024, 2025, 2270 :186-200
[17]   Machine Learning Predictions for Underestimation of Job Runtime on HPC System [J].
Guo, Jian ;
Nomura, Akihiro ;
Barton, Ryan ;
Zhang, Haoyu ;
Matsuoka, Satoshi .
SUPERCOMPUTING FRONTIERS, SCFA 2018, 2018, 10776 :179-198
[18]   Failure Prediction of Municipal Water Pipes Using Machine Learning Algorithms [J].
Liu, Wei ;
Wang, Binhao ;
Song, Zhaoyang .
WATER RESOURCES MANAGEMENT, 2022, 36 (04) :1271-1285
[19]   Failure Prediction of Municipal Water Pipes Using Machine Learning Algorithms [J].
Wei Liu ;
Binhao Wang ;
Zhaoyang Song .
Water Resources Management, 2022, 36 :1271-1285
[20]   Disk storage failure prediction in datacenter using machine learning models [J].
Manikandan Ramanathan ;
Kumar Narayanan .
Applied Nanoscience, 2023, 13 :1569-1590