Outlier Detection Forest for Large-Scale Categorical Data Sets

被引:1
作者
Sun, Zhipeng [1 ]
Du, Hongwei [1 ]
Ye, Qiang [2 ]
Liu, Chuang [1 ]
Kibenge, Patricia Lilian [2 ]
Huang, Hui [2 ]
Li, Yuying [3 ]
机构
[1] Harbin Inst Technol Shenzhen, Dept Comp Sci & Technol, Shenzhen, Peoples R China
[2] Dalhousie Univ, Fac Comp Sci, Halifax, NS, Canada
[3] Harbin Inst Technol Shenzhen, Dept Econ & Management, Shenzhen, Peoples R China
来源
COMPUTATIONAL DATA AND SOCIAL NETWORKS | 2019年 / 11917卷
基金
中国国家自然科学基金;
关键词
Categorical data; Outlier detection; Big data; Entropy; ALGORITHMS;
D O I
10.1007/978-3-030-34980-6_4
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.
引用
收藏
页码:45 / 56
页数:12
相关论文
共 20 条
[1]  
Aggarwal CC, 2001, SIGMOD RECORD, V30, P37
[2]  
Bache K., 2013, UCI machine learning repository
[3]  
Barnett V., 1994, Outliers in statistical data
[4]  
Hawkins S., 2002, INT C DAT WAR KNOWL, P170, DOI DOI 10.1007/3-540-46145-0_17
[5]  
He ZY, 2006, LECT NOTES ARTIF INT, V3918, P567
[6]  
He ZY, 2005, LECT NOTES COMPUT SC, V3644, P400
[7]   Two-phase clustering process for outliers detection [J].
Jiang, MF ;
Tseng, SS ;
Su, CM .
PATTERN RECOGNITION LETTERS, 2001, 22 (6-7) :691-700
[8]  
Knorr E. M., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P392
[9]   Distance-based outliers: algorithms and applications [J].
Knorr, EM ;
Ng, RT ;
Tucakov, V .
VLDB JOURNAL, 2000, 8 (3-4) :237-253
[10]  
Knorr EM, 1999, PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P211