Monitoring high-dimensional data for failure detection and localization in large-scale computing systems

被引:22
作者
Chen, Haifeng [1 ]
Jiang, Guofei [1 ]
Yoshihira, Kenji [1 ]
机构
[1] NEC Lab America Inc, Princeton, NJ 08540 USA
关键词
failure detection; manifold learning; statistics; data mining; information system; Internet applications;
D O I
10.1109/TKDE.2007.190674
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is a major challenge to process high-dimensional measurements for failure detection and localization in large-scale computing systems. However, it is observed that in information systems, those measurements are usually located in a low-dimensional structure that is embedded in the high-dimensional space. From this perspective, a novel approach is proposed to model the geometry of underlying data generation and detect anomalies based on that model. We consider both linear and nonlinear data generation models. Two statistics, that is, the Hotelling T-2 and the squared prediction error ( SPE), are used to reflect data variations within and outside the model. We track the probabilistic density of extracted statistics to monitor the system's health. After a failure has been detected, a localization process is also proposed to find the most suspicious attributes related to the failure. Experimental results on both synthetic data and a real e-commerce application demonstrate the effectiveness of our approach in detecting and localizing failures in computing systems.
引用
收藏
页码:13 / 25
页数:13
相关论文
共 36 条
[1]  
Aggarwal C. C., 2001, SIGMOD Record, V30, P37, DOI 10.1145/376284.375668
[2]   Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks [J].
Aguilera, MK ;
Chen, W ;
Toueg, S .
THEORETICAL COMPUTER SCIENCE, 1999, 220 (01) :3-30
[3]  
Anderson TW., 1984, INTRO MULTIVARIATE S
[4]  
[Anonymous], 1988, Journal of chemometrics
[5]  
[Anonymous], 2004, P 10 ACM SIGKDD INT, DOI DOI 10.1145/1014052.1014102
[6]  
Balasubramanian M, 2002, SCIENCE, V295
[7]  
BARHAM P, 2003, P 9 WORKSH HOT TOP O
[8]  
Bodík P, 2005, ICAC 2005: SECOND INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING, PROCEEDINGS, P89
[9]  
BRAND M, 2003, ADV NEURAL INFORM PR, V15
[10]  
Brotherton T, 2001, AEROSP CONF PROC, P3113, DOI 10.1109/AERO.2001.931329