Expected similarity estimation for large-scale batch and streaming anomaly detection

被引:33
作者
Schneider, Markus [1 ,2 ]
Ertel, Wolfgang [2 ]
Ramos, Fabio [3 ]
机构
[1] Univ Ulm, Inst Neural Informat Proc, Ulm, Germany
[2] Univ Appl Sci Ravensburg Weingarten, Inst Artificial Intelligence, Weingarten, Germany
[3] Univ Sydney, Sch Informat Technol, Sydney, NSW, Australia
关键词
Anomaly detection; Large-scale data; Kernel methods; Hilbert space embedding; Mean map; DISTANCE-BASED OUTLIERS; ALGORITHMS; SUPPORT;
D O I
10.1007/s10994-016-5567-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel algorithm for anomaly detection on very large datasets and data streams. Themethod, named EXPected Similarity Estimation (EXPoSE), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with EXPoSE can be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, EXPoSE can make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detectionwhile being an order of magnitude faster than most other approaches.
引用
收藏
页码:305 / 333
页数:29
相关论文
共 68 条
[1]   CARDWATCH: A neural network based database mining system for credit card fraud detection [J].
Aleskerov, E ;
Freisleben, B ;
Rao, B .
PROCEEDINGS OF THE IEEE/IAFE 1997 COMPUTATIONAL INTELLIGENCE FOR FINANCIAL ENGINEERING (CIFER), 1997, :220-226
[2]  
Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15
[3]  
Angiulli F., 2007, CIKM, P811, DOI [10.1145/1321440.1321552, DOI 10.1145/1321440.1321552]
[4]   Distance-based outlier queries in data streams: the novel task and algorithms [J].
Angiulli, Fabrizio ;
Fassetti, Fabio .
DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (02) :290-324
[5]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[6]  
[Anonymous], 2012, P 15 INT C ARTIFICIA
[7]  
Bache K., 2013, UCI Machine Learning Repository
[8]  
Bifet A, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P139
[9]  
Border K., 2006, INFIN DIMENS ANAL QU, DOI [10.1007/3-540-29587-9, DOI 10.1007/3-540-29587-9.]
[10]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104