Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures

被引：66

作者：

Guan, Qiang ^{[1
]}

Fu, Song ^{[1
]}

机构：

[1] Univ North Texas, Dept Comp Sci & Engn, Denton, TX 76203 USA

来源：

2013 IEEE 32ND INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2013) | 2013年

关键词：

Cloud computing; Dependable systems; Failure detection; Autonomic management; Learning algorithms;

D O I：

10.1109/SRDS.2013.29

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software faults and environmental factors. Autonomic anomaly detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect anomalous cloud behaviors, we need to monitor the cloud execution and collect runtime cloud performance data. These data consist of values of performance metrics for different types of failures, which display different correlations with the performance metrics. In this paper, we present an adaptive anomaly identification mechanism that explores the most relevant principal components of different failure types in cloud computing infrastructures. It integrates the cloud performance metric analysis with filtering techniques to achieve automated, efficient, and accurate anomaly identification. The proposed mechanism adapts itself by recursively learning from the newly verified detection results to refine future detections. We have implemented a prototype of the anomaly identification system and conducted experiments in an on-campus cloud computing environment and by using the Google datacenter traces. Our experimental results show that our mechanism can achieve more efficient and accurate anomaly detection than other existing schemes.

引用

页码：205 / 214

页数：10

共 40 条

[1]

[Anonymous], 2007, SC'07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

[2]

[Anonymous], 2011, P IEEE INT C AV REL

[3] A View of Cloud Computing [J].

Armbrust, Michael ;

Fox, Armando ;

Griffith, Rean ;

Joseph, Anthony D. ;

Katz, Randy ;

Konwinski, Andy ;

Lee, Gunho ;

Patterson, David ;

Rabkin, Ariel ;

Stoica, Ion ;

Zaharia, Matei .

COMMUNICATIONS OF THE ACM, 2010, 53 (04) :50-58

[4] NonStop advanced architecture [J].

Bernick, D ;

Bruckert, B ;

Del Vigna, P ;

Garcia, D ;

Jardine, R ;

Klecka, J ;

Smullen, J .

2005 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2005, :12-21

[5] Applying PCA for Traffic Anomaly Detection: Problems and Solutions [J].

Brauckhoff, Daniela ;

Salamatian, Kave ;

May, Martin .

IEEE INFOCOM 2009 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS 1-5, 2009, :2866-+

[6]

DeBardeleben N., 2011, P RES INT EUR C PAR

[7]

Duda R.O., 2001, PATTERN ANAL APPL, DOI DOI 10.1007/BF01237942

[8]

Ford Daniel, 2010, P USENIX OSDI

[9]

FU S, 2011, GLOB TELECOMM CONF, pNI515

[10] Quantifying temporal and spatial correlation of failure events for proactive management [J].

Fu, Song ;

Xu, Cheng-Zhong .

SRDS 2007: 26TH IEEE INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2007, :175-+

← 1 2 3 4 →