Failure prediction using machine learning in a virtualised HPC system and application

被引:1
作者
Bashir Mohammed
Irfan Awan
Hassan Ugail
Muhammad Younas
机构
[1] University of Bradford,School of Electrical Engineering and Computer Science
[2] Oxford Brookes University,Department of Computing & Communication Technologies
来源
Cluster Computing | 2019年 / 22卷
关键词
Failure; Machine learning; High performance computing; Cloud computing;
D O I
暂无
中图分类号
学科分类号
摘要
Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.
引用
收藏
页码:471 / 485
页数:14
相关论文
共 50 条
[1]   Failure prediction using machine learning in a virtualised HPC system and application [J].
Mohammed, Bashir ;
Awan, Irfan ;
Ugail, Hassan ;
Younas, Muhammad .
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (02) :471-485
[2]   Failure prediction in the refinery piping system using machine learning algorithms: classification and comparison [J].
Kanoun, Yassine ;
Aghbash, Aynaz Mohammadi ;
Belem, Tikou ;
Zouari, Bassem ;
Mrad, Hatem .
5TH INTERNATIONAL CONFERENCE ON INDUSTRY 4.0 AND SMART MANUFACTURING, ISM 2023, 2024, 232 :1663-1672
[3]   Online Anomaly Detection Using Machine Learning and HPC for Power System Synchrophasor Measurements [J].
Ren, Huiying ;
Hou, Zhangshuan ;
Etingov, Pavel .
2018 IEEE INTERNATIONAL CONFERENCE ON PROBABILISTIC METHODS APPLIED TO POWER SYSTEMS (PMAPS), 2018,
[4]   A Machine Learning Approach for an HPC Use Case: the Jobs Queuing Time Prediction [J].
Vercellino, Chiara ;
Scionti, Alberto ;
Varavallo, Giuseppe ;
Viviani, Paolo ;
Vitali, Giacomo ;
Terzo, Olivier .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 143 :215-230
[5]   Using Machine Learning for Prediction Students Failure in Morocco: An Application of the CRISP-DM Methodology [J].
Lebkiri, Nada ;
Daoudi, Mohamed ;
Abidli, Zakaria ;
Elturk, Joumana ;
Soulaymani, Abdelmajid ;
Khatori, Youssef ;
El Madhi, Youssef ;
Benattou, Mohammed .
INTERNATIONAL JOURNAL OF EDUCATION AND INFORMATION TECHNOLOGIES, 2021, 15 :344-352
[6]   Software Quality Prediction Using Machine Learning Application [J].
Naiyer, Vaseem ;
Sheetlani, Jitendra ;
Singh, Harsh Pratap .
SMART INTELLIGENT COMPUTING AND APPLICATIONS, VOL 2, 2020, 160 :319-327
[7]   Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning [J].
Tuncer, Ozan ;
Ates, Emre ;
Zhang, Yijia ;
Turk, Ata ;
Brandt, Jim ;
Leung, Vitus J. ;
Egele, Manuel ;
Coskun, Ayse K. .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (04) :883-896
[8]   Prediction of SACCOS Failure in Tanzania using Machine Learning Models [J].
Magashi, Cosmas H. ;
Agbinya, Johnson ;
Sam, Anael ;
Mbelwa, Jimmy .
ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2024, 14 (01) :12887-12891
[9]   Improving risk prediction in heart failure using machine learning [J].
Adler, Eric D. ;
Voors, Adriaan A. ;
Klein, Liviu ;
Macheret, Fima ;
Braun, Oscar O. ;
Urey, Marcus A. ;
Zhu, Wenhong ;
Sama, Iziah ;
Tadel, Matevz ;
Campagnari, Claudio ;
Greenberg, Barry ;
Yagil, Avi .
EUROPEAN JOURNAL OF HEART FAILURE, 2020, 22 (01) :139-147
[10]   Failure Prediction in Automatic Reclosers Using Machine Learning Approaches [J].
Righetto, Sophia Boing ;
Hattori, Leandro Takeshi ;
Nunes, Guilherme Goncalves ;
Carvalho, Edgar Gerevini ;
Izumida Martins, Marcos A. ;
De Francisci, Silvia .
2021 IEEE URUCON, 2021, :320-324