Energy-based anomaly detection for mixed data

被引:9
作者
Do, Kien [1 ]
Truyen Tran [1 ]
Venkatesh, Svetha [1 ]
机构
[1] Deakin Univ, Appl AI Inst, 75 Pigdons Rd, Waurn Ponds, Vic 3216, Australia
关键词
Mixed data; Mixed-variate restricted Boltzmann machine; Deep belief net; Multilevel anomaly detection; OUTLIER DETECTION APPROACH;
D O I
10.1007/s10115-018-1168-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Anomalies are those deviating significantly from the norm. Thus, anomaly detection amounts to finding data points located far away from their neighbors, i.e., those lying in low-density regions. Classic anomaly detection methods are largely designed for single data type such as continuous or discrete. However, real-world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Mixed data poses multiple challenges including (a) capturing the inter-type correlation structures and (b) measuring deviation from the norm under multiple types. These challenges are exaggerated under (c) high-dimensional regimes. In this paper, we propose a new scalable unsupervised anomaly detection method for mixed data based on Mixed-variate Restricted Boltzmann Machine (Mv. RBM). The Mv. RBM is a principled probabilistic method that estimates density of mixed data. We propose to use free energy derived from Mv. RBM as anomaly score as it is identical to data negative log-density up to an additive constant. We then extend this method to detect anomalies across multiple levels of data abstraction, an effective approach to deal with high-dimensional settings. The extension is dubbed MIXMAD, which stands for MIXed data Multilevel Anomaly Detection. In MIXMAD, we sequentially construct an ensemble of mixed-data Deep Belief Nets (DBNs) with varying depths. Each DBN is an energy-based detector at a predefined abstraction level. Predictions across the ensemble are finally combined via a simple rank aggregation method. The proposed methods are evaluated on a comprehensive suit of synthetic and real high-dimensional datasets. The results demonstrate that for anomaly detection, (a) a proper handling of mixed types is necessary, (b) free energy is a powerful anomaly scoring method, (c) multilevel abstraction of data is important for high-dimensional data, and (d) empirically Mv. RBM and MIXMAD are superior to popular unsupervised detection methods for both homogeneous and mixed data.
引用
收藏
页码:413 / 435
页数:23
相关论文
共 51 条
[1]  
Aggarwal CC, 2001, LECT NOTES COMPUT SC, V1973, P420
[2]  
Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15
[3]  
[Anonymous], 2012, CIKM. ACM, DOI [10.1145/2396761.2396816, 10.1145/2396761]
[4]  
[Anonymous], P 3 AS C MACH LEARN
[5]  
[Anonymous], 2015, ACM SIGKDD explorations newsletter, DOI [DOI 10.1145/2830544.2830549, 10.1145/2830544.2830549]
[6]  
[Anonymous], INT C MACH LEARN ICM
[7]  
Becker J., 2015, SPIE DEFENSE SECURIT
[8]   Representation Learning: A Review and New Perspectives [J].
Bengio, Yoshua ;
Courville, Aaron ;
Vincent, Pascal .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) :1798-1828
[9]   Collective Anomaly Detection Based on Long Short-Term Memory Recurrent Neural Networks [J].
Bontemps, Loic ;
Van Loi Cao ;
McDermott, James ;
Nhien-An Le-Khac .
FUTURE DATA AND SECURITY ENGINEERING, FDSE 2016, 2016, 10018 :141-152
[10]   A practical outlier detection approach for mixed-attribute data [J].
Bouguessa, Mohamed .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (22) :8637-8649