Graph-based Incident Aggregation for Large-Scale Online Service Systems

被引:12
作者
Chen, Zhuangbin [1 ]
Liu, Jinyang [1 ]
Su, Yuxin [1 ]
Zhang, Hongyu [2 ]
Wen, Xuemin [3 ]
Ling, Xiao [3 ]
Yang, Yongqiang [3 ]
Lyu, Michael R. [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Univ Newcastle, Callaghan, NSW, Australia
[3] Huawei, Shenzhen, Peoples R China
来源
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021 | 2021年
基金
澳大利亚研究理事会;
关键词
Cloud computing; online service systems; incident management; graph representation learning;
D O I
10.1109/ASE51524.2021.9678746
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations more accurately, we try to recover the complete scope of failures' cascading impact by leveraging fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs). The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.
引用
收藏
页码:430 / 442
页数:13
相关论文
共 46 条
  • [1] Fast unfolding of communities in large networks
    Blondel, Vincent D.
    Guillaume, Jean-Loup
    Lambiotte, Renaud
    Lefebvre, Etienne
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2008,
  • [2] Continuous Incident Triage for Large-Scale Online Service Systems
    Chen, Junjie
    He, Xiaoting
    Lin, Qingwei
    Zhang, Hongyu
    Hao, Dan
    Gao, Feng
    Xu, Zhangwei
    Dang, Yingnong
    Zhang, Dongmei
    [J]. 34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 364 - 375
  • [3] An Empirical Investigation of Incident Triage for Online Service Systems
    Chen, Junjie
    He, Xiaoting
    Lin, Qingwei
    Xu, Yong
    Zhang, Hongyu
    Hao, Dan
    Gao, Feng
    Xu, Zhangwei
    Dang, Yingnong
    Zhang, Dongmei
    [J]. 2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, : 111 - 120
  • [4] Identifying Linked Incidents in Large-Scale Online Service Systems
    Chen, Yujun
    Yang, Xian
    Dong, Hang
    He, Xiaoting
    Zhang, Hongyu
    Lin, Qingwei
    Chen, Junjie
    Zhao, Pu
    Kang, Yu
    Gao, Feng
    Xu, Zhangwei
    Zhang, Dongmei
    [J]. PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, : 304 - 314
  • [5] Outage Prediction and Diagnosis for Cloud Service Systems
    Chen, Yujun
    Zhang, Hongyu
    Yang, Xian
    Lin, Qingwei
    Zhang, Dongmei
    Dong, Hang
    Xu, Yong
    Li, Hao
    Kang, Yu
    Gao, Feng
    Xu, Zhangwei
    Dang, Yingnong
    [J]. WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 2659 - 2665
  • [6] Chen Z., 2020, Aiops innovations of incident management for cloud services
  • [7] Towards Intelligent Incident Management: Why We Need It and How We Make It
    Chen, Zhuangbin
    Kang, Yu
    Li, Liqun
    Zhang, Xu
    Zhang, Hongyu
    Xu, Hui
    Zhou, Yangfan
    Yang, Li
    Sun, Jeffrey
    Xu, Zhangwei
    Dang, Yingnong
    Gao, Feng
    Zhao, Pu
    Qiao, Bo
    Lin, Qingwei
    Zhang, Dongmei
    Lyu, Michael R.
    [J]. PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, : 1487 - 1497
  • [8] Christensen R., 2013, SIGMOD C, P1283
  • [9] de Haan Laurens., 2007, Extreme value theory: An introduction
  • [10] Ester M., 1996, P 2 INT C KNOWL DISC, V96, P226, DOI DOI 10.5555/3001460.3001507