A Real-Time Trace-Level Root-Cause Diagnosis System in Alibaba Datacenters

被引:14
作者
Cai, Zhengong [1 ]
Li, Wei [1 ]
Zhu, Wanyi [2 ]
Liu, Lu [1 ]
Yang, Bowei [1 ]
机构
[1] Zhejiang Univ, Hangzhou 310012, Zhejiang, Peoples R China
[2] Alibaba Inc, Hangzhou 310027, Zhejiang, Peoples R China
关键词
Real-time systems; Production; Business; Data centers; Measurement; Anomaly detection; Graph-level root-cause analysis; performance anomaly; performance profiling and tracing; relative importance analysis; GRAPH SIMILARITY;
D O I
10.1109/ACCESS.2019.2944456
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Root-cause analysis (RCA) for service performance degradation can be a challenging exercise given the increasingly complex, inter-related, distributed infrastructure environment in todays enterprises. Many approaches have been applied into enterprise datacenters to improve the maintenance efficiency. A novel graph-level RCA approach is introduced in this paper, including tracing, weighted graph matching and suspicion ranking. The approach is developed based on performance profiling, tracing, and logging systems in Alibaba datacenters to speed up the real-time root-cause diagnosis. Our system allows the discovery of normative patterns and the corresponding key graph properties, which are stored and updated offline as a knowledge base for subsequently being used in trace-level risk estimation and identification of transitions that are unexpected deviations from the normative patterns. Through testing in production, we show the effectiveness of applying the graph-level RCA to discover the origins of problems and generate real-time operational support. It greatly decreases the workload for locating the root-cause of the anomaly.
引用
收藏
页码:142692 / 142702
页数:11
相关论文
共 28 条
[1]  
Aalst V., PROCESS MINING
[2]  
Abbaszadeh Z. J., 2014, THESIS
[3]   Sequence analysis and optimal matching methods in sociology - Review and prospect [J].
Abbott, A ;
Tsay, A .
SOCIOLOGICAL METHODS & RESEARCH, 2000, 29 (01) :3-33
[4]   Graph based anomaly detection and description: a survey [J].
Akoglu, Leman ;
Tong, Hanghang ;
Koutra, Danai .
DATA MINING AND KNOWLEDGE DISCOVERY, 2015, 29 (03) :626-688
[5]   Intrusion detection alarms reduction using root cause analysis and clustering [J].
Al-Mamory, Safaa O. ;
Zhang, Hongli .
COMPUTER COMMUNICATIONS, 2009, 32 (02) :419-430
[6]   Using Bayesian networks for root cause analysis in statistical process control [J].
Alaeddini, Adel ;
Dogan, Ibrahim .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (09) :11230-11243
[7]  
ATTARIYAN M., 2012, 10 USENIX S OPERATIN, P307
[8]   Pinpoint: Problem determination in large, dynamic Internet services [J].
Chen, MY ;
Kiciman, E ;
Fratkin, E ;
Fox, A ;
Brewer, E .
INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2002, :595-604
[9]  
Chigurupati A., 2017, P ANN REL MAINT S OR, P1, DOI [10.1109/RAM.2017.7889651, DOI 10.1109/RAM.2017.7889651]
[10]   A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs [J].
de Leoni, Massimiliano ;
van der Aalst, Wil M. P. ;
Dees, Marcus .
INFORMATION SYSTEMS, 2016, 56 :235-257