Log Clustering based Problem Identification for Online Service Systems

被引:309
作者
Lin, Qingwei [1 ]
Zhang, Hongyu [1 ]
Lou, Jian-Guang [1 ]
Zhang, Yu [2 ]
Chen, Xuewei [1 ]
机构
[1] Microsoft Res, Beijing 100080, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
来源
2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C) | 2016年
关键词
Logs; Problem Identification; Log Clustering; Diagnosis; Online Service System;
D O I
10.1145/2889160.2889232
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Logs play an important role in the maintenance of large-scale online service systems. When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems. Traditionally, engineers perform simple keyword search (such as "error" and "exception") of logs that may be associated with the failures. Such an approach is often time consuming and error prone. Through our collaboration with Microsoft service product teams, we propose LogCluster, an approach that clusters the logs to ease log-based problem identification. LogCluster also utilizes a knowledge base to check if the log sequences occurred before. Engineers only need to examine a small number of previously unseen, representative log sequences extracted from the clusters to identify a problem, thus significantly reducing the number of logs that should be examined, meanwhile improving the identification accuracy. Through experiments on two Hadoop-based applications and two large-scale Microsoft online service systems, we show that our approach is effective and outperforms the state-of-the-art work proposed by Shang et al. in ICSE 2013. We have successfully applied LogCluster to the maintenance of many actual Microsoft online service systems. In this paper, we also share our success stories and lessons learned.
引用
收藏
页码:102 / 111
页数:10
相关论文
共 25 条
[1]  
[Anonymous], 2015, PROC USENIX ANN TECH
[2]  
[Anonymous], 2008, Introduction to information retrieval
[3]   Finding failures by cluster analysis of execution profiles [J].
Dickinson, W ;
Leon, D ;
Podgurski, A .
PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, 2001, :339-348
[4]   Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems [J].
Ding, Rui ;
Fu, Qiang ;
Lou, Jian-Guang ;
Lin, Qingwei ;
Zhang, Dongmei ;
Xie, Tao .
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, :311-322
[5]  
Ding R, 2012, IEEE INT CONF AUTOM, P318, DOI 10.1145/2351676.2351735
[6]   Where Do Developers Log? An Empirical Study on Logging Practices in Industry [J].
Fu, Qiang ;
Zhu, Jieming ;
Hu, Wenlu ;
Lou, Jian-Guang ;
Ding, Rui ;
Lin, Qingwei ;
Zhang, Dongmei ;
Xie, Tao .
36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE COMPANION 2014), 2014, :24-33
[7]   Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis [J].
Fu, Qiang ;
Lou, Jian-Guang ;
Wang, Yi ;
Li, Jiang .
2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, :149-+
[8]  
Gower J. C., 1969, APPL STAT, V18, DOI DOI 10.2307/2346439
[9]  
Isard M., 2007, P EUROSYS 2007 MAR
[10]   Abstracting Execution Logs to Execution Events for Enterprise Applications [J].
Jiang, Zhen Ming ;
Hassan, Ahmed E. ;
Flora, Parminder ;
Hamann, Gilbert .
QSIC 2008: PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON QUALITY SOFTWARE, 2008, :181-+