A Real-Time Trace-Level Root-Cause Diagnosis System in Alibaba Datacenters

被引:14
作者
Cai, Zhengong [1 ]
Li, Wei [1 ]
Zhu, Wanyi [2 ]
Liu, Lu [1 ]
Yang, Bowei [1 ]
机构
[1] Zhejiang Univ, Hangzhou 310012, Zhejiang, Peoples R China
[2] Alibaba Inc, Hangzhou 310027, Zhejiang, Peoples R China
关键词
Real-time systems; Production; Business; Data centers; Measurement; Anomaly detection; Graph-level root-cause analysis; performance anomaly; performance profiling and tracing; relative importance analysis; GRAPH SIMILARITY;
D O I
10.1109/ACCESS.2019.2944456
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Root-cause analysis (RCA) for service performance degradation can be a challenging exercise given the increasingly complex, inter-related, distributed infrastructure environment in todays enterprises. Many approaches have been applied into enterprise datacenters to improve the maintenance efficiency. A novel graph-level RCA approach is introduced in this paper, including tracing, weighted graph matching and suspicion ranking. The approach is developed based on performance profiling, tracing, and logging systems in Alibaba datacenters to speed up the real-time root-cause diagnosis. Our system allows the discovery of normative patterns and the corresponding key graph properties, which are stored and updated offline as a knowledge base for subsequently being used in trace-level risk estimation and identification of transitions that are unexpected deviations from the normative patterns. Through testing in production, we show the effectiveness of applying the graph-level RCA to discover the origins of problems and generate real-time operational support. It greatly decreases the workload for locating the root-cause of the anomaly.
引用
收藏
页码:142692 / 142702
页数:11
相关论文
共 28 条
[21]  
Noble Caleb C, 2003, KDD, P631
[22]   Web graph similarity for anomaly detection [J].
Papadimitriou, Panagiotis ;
Dasdan, Ali ;
Garcia-Molina, Hector .
JOURNAL OF INTERNET SERVICES AND APPLICATIONS, 2010, 1 (01) :19-30
[23]  
Suriadi S, 2013, LECT NOTES BUS INF P, V132, P174
[24]  
van der Aalst WMP, 2005, LECT NOTES COMPUT SC, V3536, P48
[25]   Identifying and Characterizing Nodes Important to Community Structure Using the Spectrum of the Graph [J].
Wang, Yang ;
Di, Zengru ;
Fan, Ying .
PLOS ONE, 2011, 6 (11)
[26]  
Wasserman S., Social network analysis: Methods and applications
[27]   Graph similarity scoring and matching [J].
Zager, Laura A. ;
Verghese, George C. .
APPLIED MATHEMATICS LETTERS, 2008, 21 (01) :86-94
[28]  
Zhu W. Y., 2002, P 18 IND C DAT MIN