Fast Parallel Outlier Detection for Categorical Datasets using MapReduce

被引:15
作者
Koufakou, Anna [1 ]
Secretan, Jimmy [1 ]
Reeder, John [1 ]
Cardona, Kelvin [1 ]
Georgiopoulos, Michael [1 ]
机构
[1] Univ Cent Florida, Sch EECS, Orlando, FL 32816 USA
来源
2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8 | 2008年
关键词
D O I
10.1109/IJCNN.2008.4634266
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Outlier detection has received considerable attention in many applications, such as detecting network attacks or credit card fraud. The massive datasets currently available for mining in some of these outlier detection applications require large parallel systems, and consequently parallelizable outlier detection methods. Most existing outlier detection methods assume that all of the attributes of a dataset are numerical, usually have a quadratic time complexity with respect to the number of points in the dataset, and quite often they require multiple dataset scans. In this paper, we propose a fast parallel outlier detection strategy based on the Attribute Value Frequency (AVF) approach, a high-speed, scalable outlier detection method for categorical data that is inherently easy to parallelize. Our proposed solution, MR-AVF, is based on the MapReduce paradigm for parallel programming, which offers load balancing and fault tolerance. MR-AVF is particularly simple to develop and it is shown to be highly scalable with respect to the number of cluster nodes.
引用
收藏
页码:3298 / 3304
页数:7
相关论文
共 23 条
[1]  
Agrawal R., 1994, Proceedings of the 20th International Conference on Very Large Data Bases. VLDB'94, P487
[2]  
Barnett V., 1984, Outliers in Statistical Data, V2nd
[3]  
BAY SD, 2003, P ACM SIGKDD INT C K
[4]  
BLAKC C, UCI MACHINE LEARNING
[5]  
Bolton RJ, 2002, STAT SCI, V17, P235
[6]  
Breunig M. M., 2000, P ACM SIGMOD INT C M, P1
[7]  
CHU CT, 2006, P NIPS, V19
[8]  
CRISTOFOR D, 2002, J UNIVERS COMPUT SCI, P153
[9]  
DEAN J, 2004, P OSDI 04 S OP SYST
[10]  
Ghemawat S, 2003, P 19 ACM S OP SYST P