Hora: Architecture-aware online failure prediction

被引:38
|
作者
Pitakrat, Teerat [1 ]
Okanovic, Dusan [1 ]
van Hoorn, Andre [1 ]
Grunske, Lars [2 ]
机构
[1] Univ Stuttgart, Inst Software Technol, Reliable Software Syst, Stuttgart, Germany
[2] Humboldt Univ, Dept Comp Sci, Software Engn, Berlin, Germany
关键词
Online failure prediction; Reliability; Component-based software systems; ERROR PROPAGATION; RELIABILITY; MODEL;
D O I
10.1016/j.jss.2017.02.041
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Complex software systems experience failures at runtime even though a lot of effort is put into the development and operation. Reactive approaches detect these failures after they have occurred and already caused serious consequences. In order to execute proactive actions, the goal of online failure prediction is to detect these failures in advance by monitoring the quality of service or the system events. Current failure prediction approaches look at the system or individual components as a monolith without considering the architecture of the system. They disregard the fact that the failure in one component can propagate through the system and cause problems in other components. In this paper, we propose a hierarchical online failure prediction approach, called HORA, which combines component failure predictors with architectural knowledge. The failure propagation is modeled using Bayesian networks which incorporate both prediction results and component dependencies extracted from the architectural models. Our approach is evaluated using Netflix's server-side distributed RSS reader application to predict failures caused by three representative types of faults: memory leak, system overload, and sudden node crash. We compare HORA to a monolithic approach and the results show that our approach can improve the area under the ROC curve by 9.9%. (C) 2017 The Authors. Published by Elsevier Inc.
引用
收藏
页码:669 / 685
页数:17
相关论文
共 50 条
  • [1] An Architecture-aware Approach to Hierarchical Online Failure Prediction
    Pitakrat, Teerat
    Okanovic, Dusan
    van Hoorn, Andre
    Grunske, Lars
    2016 12TH INTERNATIONAL ACM SIGSOFT CONFERENCE ON QUALITY OF SOFTWARE ARCHITECTURES (QOSA), 2016, : 60 - 69
  • [2] A Survey of Online Failure Prediction Methods
    Salfner, Felix
    Lenk, Maren
    Malek, Miroslaw
    ACM COMPUTING SURVEYS, 2010, 42 (03)
  • [3] Online Failure Prediction for Complex Systems: Methodology and Case Studies
    Campos, Joao R.
    Costa, Ernesto
    Vieira, Marco
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2023, 20 (04) : 3520 - 3534
  • [4] Explore unlabeled big data learning to online failure prediction in safety-aware cloud environment
    Zhao, Jia
    Ding, Yan
    Zhai, Yunan
    Jiang, Yuqiang
    Zhai, Yujuan
    Hu, Ming
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 153 : 53 - 63
  • [5] HyperPRAW: Architecture-Aware Hypergraph Restreaming Partition to Improve Performance of Parallel Applications Running on High Performance Computing Systems
    Musoles, Carlos Fernandez
    Coca, Daniel
    Richmond, Paul
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [6] Seer: A Lightweight Online Failure Prediction Approach
    Ozcelik, Burcu
    Yilmaz, Cemal
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2016, 42 (01) : 26 - 46
  • [7] BlockHammer: Improving Flash Reliability by Exploiting Process Variation Aware Proactive Failure Prediction
    Ma, Ruixiang
    Wu, Fei
    Lu, Zhonghai
    Zhong, Wenmin
    Wu, Qiulin
    Wan, Jiguang
    Xie, Changsheng
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (12) : 4563 - 4574
  • [8] Increasing Dependability of Component-based Software Systems by Online Failure Prediction
    Pitakrat, Teerat
    van Hoorn, Andre
    Grunske, Lars
    2014 TENTH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC), 2014, : 78 - 81
  • [9] Online Failure Prediction for Railway Transportation Systems Based on Fuzzy Rules and Data Analysis
    Ding, Zuohua
    Zhou, Yuan
    Pu, Geguang
    Zhou, MengChu
    IEEE TRANSACTIONS ON RELIABILITY, 2018, 67 (03) : 1143 - 1158
  • [10] Architecture-level software performance abstractions for online, performance prediction
    Brosig, Fabian
    Huber, Nikolaus
    Kounev, Samuel
    SCIENCE OF COMPUTER PROGRAMMING, 2014, 90 : 71 - 92