Expected similarity estimation for large-scale batch and streaming anomaly detection

被引:0
作者
Markus Schneider
Wolfgang Ertel
Fabio Ramos
机构
[1] University of Ulm,Institute of Neural Information Processing
[2] University of Applied Sciences Ravensburg-Weingarten,Institute for Artificial Intelligence
[3] The University of Sydney,School of Information Technologies
来源
Machine Learning | 2016年 / 105卷
关键词
Anomaly detection; Large-scale data; Kernel methods; Hilbert space embedding; Mean map;
D O I
暂无
中图分类号
学科分类号
摘要
We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (expose), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with exposecan be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, exposecan make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being an order of magnitude faster than most other approaches.
引用
收藏
页码:305 / 333
页数:28
相关论文
共 68 条
[1]  
Angiulli F(2010)Distance-based outlier queries in data streams: The novel task and algorithms Data Mining and Knowledge Discovery 20 290-324
[2]  
Fassetti F(2009)Anomaly detection: A survey ACM Computing Surveys (CSUR) 41 1-58
[3]  
Chandola V(2011)LIBSVM: A library for support vector machines ACM Transactions on Intelligent Systems and Technology 2 1-27
[4]  
Banerjee A(2008)MapReduce: Simplified data processing on large clusters Communications of the ACM 51 107-113
[5]  
Kumar V(2006)Statistical comparisons of classifiers over multiple data sets The Journal of Machine Learning Research 7 1-30
[6]  
Chang CC(1937)The use of ranks to avoid the assumption of normality implicit in the analysis of variance Journal of the American Statistical Association 32 675-701
[7]  
Lin CJ(2014)A survey on concept drift adaptation ACM Computing Surveys (CSUR) 46 44-773
[8]  
Dean J(2012)A kernel two-sample test The Journal of Machine Learning Research 13 723-2267
[9]  
Ghemawat S(2014)Outlier detection for temporal data: A survey Knowledge and Data Engineering, IEEE Transactions on 26 2250-126
[10]  
Demšar J(2004)A survey of outlier detection methodologies Artificial Intelligence Review 22 85-595