Failure prediction using machine learning in a virtualised HPC system and application

被引:1
|
作者
Bashir Mohammed
Irfan Awan
Hassan Ugail
Muhammad Younas
机构
[1] University of Bradford,School of Electrical Engineering and Computer Science
[2] Oxford Brookes University,Department of Computing & Communication Technologies
来源
Cluster Computing | 2019年 / 22卷
关键词
Failure; Machine learning; High performance computing; Cloud computing;
D O I
暂无
中图分类号
学科分类号
摘要
Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.
引用
收藏
页码:471 / 485
页数:14
相关论文
共 50 条
  • [1] Failure prediction using machine learning in a virtualised HPC system and application
    Mohammed, Bashir
    Awan, Irfan
    Ugail, Hassan
    Younas, Muhammad
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (02): : 471 - 485
  • [2] Application of Machine Learning for Dragline Failure Prediction
    Taghizadeh, Amir
    Demirel, Nuray
    1ST SCIENTIFIC PRACTICAL CONFERENCE INTERNATIONAL INNOVATIVE MINING SYMPOSIUM (IN MEMORY OF PROF. VLADIMIR PRONOZA), 2017, 15
  • [3] Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
    Nie, Bin
    Xue, Ji
    Gupta, Saurabh
    Patel, Tirthak
    Engelmann, Christian
    Smirni, Evgenia
    Tiwari, Devesh
    2018 48TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2018, : 95 - 106
  • [4] Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC
    Das, Anwesha
    Mueller, Frank
    Siegel, Charles
    Vishnu, Abhinav
    HPDC '18: PROCEEDINGS OF THE 27TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2018, : 40 - 51
  • [5] Online Job Failure Prediction in an HPC System
    Antici, Francesco
    Borghesi, Andrea
    Kiziltan, Zeynep
    EURO-PAR 2023: PARALLEL PROCESSING WORKSHOPS, PT II, EURO-PAR 2023, 2024, 14352 : 167 - 179
  • [6] Prediction of HPC compressive strength based on machine learning
    Jin, Libing
    Duan, Jie
    Jin, Yichen
    Xue, Pengfei
    Zhou, Pin
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [7] Heart Failure Prediction: Machine Learning Application in Critical Care
    Sharma, Himanshu
    Sharma, Gitika
    Sharma, Sachin
    Mishra, Abhijat
    Singh, Avineet
    Sharma, Harshvardhan
    2024 43RD INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, SRDS 2024, 2024, : 361 - 366
  • [8] Failure prediction of turbines using machine learning algorithms
    Kumar, R. Sachin
    Ram, S. Sakthiya
    Jayakar, S. Arun
    Kumar, T. K. Senthil
    MATERIALS TODAY-PROCEEDINGS, 2022, 66 : 1175 - 1182
  • [9] Prediction of creep failure time using machine learning
    Soumyajyoti Biswas
    David Fernandez Castellanos
    Michael Zaiser
    Scientific Reports, 10
  • [10] Prediction of creep failure time using machine learning
    Biswas, Soumyajyoti
    Castellanos, David Fernandez
    Zaiser, Michael
    SCIENTIFIC REPORTS, 2020, 10 (01)