Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures

被引:66
作者
Guan, Qiang [1 ]
Fu, Song [1 ]
机构
[1] Univ North Texas, Dept Comp Sci & Engn, Denton, TX 76203 USA
来源
2013 IEEE 32ND INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2013) | 2013年
关键词
Cloud computing; Dependable systems; Failure detection; Autonomic management; Learning algorithms;
D O I
10.1109/SRDS.2013.29
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software faults and environmental factors. Autonomic anomaly detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect anomalous cloud behaviors, we need to monitor the cloud execution and collect runtime cloud performance data. These data consist of values of performance metrics for different types of failures, which display different correlations with the performance metrics. In this paper, we present an adaptive anomaly identification mechanism that explores the most relevant principal components of different failure types in cloud computing infrastructures. It integrates the cloud performance metric analysis with filtering techniques to achieve automated, efficient, and accurate anomaly identification. The proposed mechanism adapts itself by recursively learning from the newly verified detection results to refine future detections. We have implemented a prototype of the anomaly identification system and conducted experiments in an on-campus cloud computing environment and by using the Google datacenter traces. Our experimental results show that our mechanism can achieve more efficient and accurate anomaly detection than other existing schemes.
引用
收藏
页码:205 / 214
页数:10
相关论文
共 40 条
[21]  
Leners JB, 2011, SOSP 11: PROCEEDINGS OF THE TWENTY-THIRD ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, P279
[22]  
Li Z., 2010, P USENIX C NETW SYST
[23]  
Longo F, 2011, I C DEPEND SYS NETWO, P335, DOI 10.1109/DSN.2011.5958247
[24]   Feature extraction for novelty detection as applied to fault detection in machinery [J].
McBain, Jordan ;
Timusk, Markus .
PATTERN RECOGNITION LETTERS, 2011, 32 (07) :1054-1061
[25]  
Murray J.F., 2005, Journal of Machine Learning research, P816
[26]  
Pannu H., 2012, P IEEE GLOB COMM C G
[27]   AAD: Adaptive Anomaly Detection System for Cloud Computing Infrastructures [J].
Pannu, Husanbir S. ;
Liu, Jianguo ;
Fu, Song .
2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012), 2012, :396-+
[28]   Anomaly Detection: A Survey [J].
Chandola, Varun ;
Banerjee, Arindam ;
Kumar, Vipin .
ACM COMPUTING SURVEYS, 2009, 41 (03)
[29]   Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems [J].
Guan, Qiang ;
Zhang, Ziming ;
Fu, Song .
Journal of Communications, 2012, 7 (01) :52-61
[30]   A new dependency and correlation analysis for features [J].
Qu, GZ ;
Hariri, S ;
Yousif, M .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (09) :1199-1207