Incremental Causal Graph Learning for Online Root Cause Analysis

被引:11
作者
Wang, Dongjie [1 ]
Chen, Zhengzhang [2 ]
Fu, Yanjie [1 ]
Liu, Yanchi [2 ]
Chen, Haifeng [2 ]
机构
[1] Univ Cent Florida, Orlando, FL 32816 USA
[2] NEC Labs Amer Inc, Princeton, NJ 08540 USA
来源
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023 | 2023年
基金
美国国家科学基金会;
关键词
Root Cause Analysis; AIOps; Causal Discovery; Trigger Point Detection; Incremental Learning; Disentangled Graph Learning; MODEL;
D O I
10.1145/3580305.3599392
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of root cause analysis (RCA) is to identify the root causes of system faults/failures by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure recovery and mitigate system damages or financial losses. However, previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process, a significant amount of time and data to train a robust model, and then being retrained from scratch for a new system fault. In this paper, we propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model. CORAL consists of Trigger Point Detection, Incremental Disentangled Causal Graph Learning, and Network Propagation-based Root Cause Localization. The Trigger Point Detection component aims to detect system state transitions automatically and in near-real-time. To achieve this, we develop an online trigger point detection approach based on multivariate singular spectrum analysis and cumulative sum statistics. To efficiently update the RCA model, we propose an incremental disentangled causal graph learning approach to decouple the state-invariant and state-dependent information. After that, CORAL applies a random walk with restarts to the updated causal graph to accurately identify root causes. The online RCA process terminates when the causal graph and the generated root cause list converge. Extensive experiments on three real-world datasets demonstrate the effectiveness and superiority of the proposed framework.
引用
收藏
页码:2269 / 2278
页数:10
相关论文
共 56 条
[1]  
Ahmed C.M., 2017, P 3 INT WORKSH CYB S, P25
[2]  
Alanqary Arwa, 2021, P 35 C NEUR INF PROC, V34, P23218
[3]   A survey of methods for time series change point detection [J].
Aminikhanghahi, Samaneh ;
Cook, Diane J. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 51 (02) :339-367
[4]  
Andersen Bjorn, 2006, Root Cause Analysis: Simplified Tools and Techniques, VSecond
[5]  
[Anonymous], 2018, NEURIPS
[6]  
Assaad CK, 2022, J ARTIF INTELL RES, V73, P767
[7]  
Bajak F., 2021, WHY DID AMAZON WEB S
[8]   Real-time fault detection in PV systems under MPPT using PMU and high-frequency multi-sensor data through online PCA-KDE-based multivariate KL divergence [J].
Bakdi, Azzeddine ;
Bounoua, Wahiba ;
Guichi, Amar ;
Mekhilef, Saad .
INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2021, 125
[9]  
Bellot Alexis, 2021, ICLR
[10]   Graph-based root cause analysis for service-oriented and microservice architectures [J].
Brandon, Alvaro ;
Sole, Marc ;
Huelamo, Alberto ;
Solans, David ;
Perez, Maria S. ;
Muntes-Mulero, Victor .
JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 159