Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems

被引:7
作者
Lu, Xianglin [1 ]
Xie, Zhe [2 ]
Li, Zeyan [1 ]
Li, Mingjie [1 ]
Nie, Xiaohui [3 ]
Zhao, Nengwen [1 ]
Yu, Qingyang [1 ]
Zhan, Shenglin [4 ,6 ]
Sui, Kaixin [3 ]
Zhu, Lin [5 ]
Pei, Dan [1 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[3] BizSeer, Beijing, Peoples R China
[4] Nankai Univ, Tianjin, Peoples R China
[5] China Mobile Res, Beijing, Peoples R China
[6] Haihe Lab Informat Technol Applicat Innovat HL It, Tianjin, Peoples R China
来源
2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022) | 2022年
基金
国家重点研发计划; 中国国家自然科学基金; 中国博士后科学基金;
关键词
OLTP database systems; performance diagnosis; causal inference;
D O I
10.1109/CCGrid54584.2022.00075
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Online transaction processing (OLTP) database systems provide an effective solution to data support for online applications with high concurrency and low latency. An interruption or performance degradation of OLTP database systems may impact the availability of services and bring substantial economic loss. Thus, diagnosing the issue timely and mitigating it rapidly are essential for database administrators (DBAs). However, performance diagnosis for database systems is challenging due to numerous abnormal metrics, complex failure propagation, and high-performance requirements. Existing works relying on anomaly detection or causal graph construction cannot handle all these challenges simultaneously. In this paper, we propose an unsupervised learning-based method, CauseRank, to perform root cause localization with superior efficiency, high accuracy, and good interpretability. Two key techniques in CauseRank are a novel causal discovery algorithm named Group-based Greedy Equivalent Search (G-GES) incorporated with domain knowledge which treats metric groups as nodes to capture failure propagation and a simple yet effective ranking method named Causal Oriented Personalized PageRank (COPP). Extensive experiments on 97 real-world failure cases collected from a large-scale Oracle database demonstrate the effectiveness of CauseRank, achieving 82.5% top-3 accuracy and 93.8% top-5 accuracy and outperforming baseline approaches. The core idea and framework of CauseRank are generic and can be applied to other large-scale system components.
引用
收藏
页码:655 / 664
页数:10
相关论文
共 30 条
[1]  
[Anonymous], 2011, UAI
[2]  
[Anonymous], 2005, P C INN DAT SYST RES
[3]  
Bodík P, 2010, EUROSYS'10: PROCEEDINGS OF THE EUROSYS 2010 CONFERENCE, P111
[4]   Graph-based root cause analysis for service-oriented and microservice architectures [J].
Brandon, Alvaro ;
Sole, Marc ;
Huelamo, Alberto ;
Solans, David ;
Perez, Maria S. ;
Muntes-Mulero, Victor .
JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 159
[5]  
Chen PF, 2014, IEEE INFOCOM SER, P1887, DOI 10.1109/INFOCOM.2014.6848128
[6]  
Chickering D. M., 2003, Journal of Machine Learning Research, V3, P507, DOI 10.1162/153244303321897717
[7]  
Cornejo R., 2018, DYNAMIC ORACLE PERFO, P61
[8]  
Dittrich K. R., 1995, P RUL DAT SYST
[9]  
Dogga P., 2021, ARXIV
[10]  
Jeh G., 2003, P 12 INT C WORLD WID