CANF: Clustering and anomaly detection method using nearest and farthest neighbor

被引:23
作者
Faroughi, Azadeh [1 ]
Javidan, Reza [1 ]
机构
[1] Shiraz Univ Technol, Comp Engn & IT Dept, Shiraz, Iran
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2018年 / 89卷
关键词
Nearest neighbor density estimator; Farthest neighbor; Subgroups; Anomaly detection; Clustering; Principal component analysis (PCA); ALGORITHM;
D O I
10.1016/j.future.2018.06.031
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Nearest-neighbor density estimators usually do not work well for high dimensional datasets. Moreover, they have high time complexity of O(n(2)) and require high memory usage, especially when indexing is used. These problems impose limitations on applying them for small datasets. In order to overcome these limitations, we proposed a new method called CANF which stands for clustering and anomaly detection using nearest and farthest neighbors. This method calculates distances to nearest and farthest neighbor nodes to create dataset subgroups. Therefore, computational time complexity is of O(n log n) and space complexity is constant. In each iteration of subgroup formations, outlier points of subgroups are detected. After subgroup formation, a proposed assembling technique is used to derive correct clusters. CANF uses a new parameter to detect clusters which are not easily separable. Many experiments on synthetic datasets are carried out to demonstrate the feasibility of CANF. Furthermore, on real-world datasets we compared this algorithm to similar algorithms in anomaly detection task and in clustering task namely LOF and DBSCAN, respectively and the results showed significantly higher accuracy of the CANF, especially in high dimensions. Moreover, to overcome high dimensional datasets problems, Principal Component Analysis (PCA) is used in the clustering method, which preprocesses high-dimensional data. The results showed the effectiveness of the proposed method both for clustering as well as anomaly detection applications. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:166 / 177
页数:12
相关论文
共 38 条
[1]  
Achtert E., ADV DATABASES CONCEP, P152
[2]   BoostMap: An embedding method for efficient nearest neighbor retrieval [J].
Athitsos, Vassilis ;
Alon, Jonathan ;
Sclaroff, Stan ;
Kollios, George .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (01) :89-104
[3]  
Bay S.D., 2003, P 9 ACM SIGKDD INT C, P29, DOI [DOI 10.1145/1143844.1143857, 10.1145/956750.956758]
[4]  
Beygelzimer A., 2006, ICML
[5]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[6]   Robust path-based spectral clustering [J].
Chang, Hong ;
Yeung, Dit-Yan .
PATTERN RECOGNITION, 2008, 41 (01) :191-203
[7]   iLike: Bridging the Semantic Gap in Vertical Image Search by Integrating Text and Visual Features [J].
Chen, Yuxin ;
Sampathkumar, Hariprasad ;
Luo, Bo ;
Chen, Xue-wen .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) :2257-2270
[8]   MEAN SHIFT, MODE SEEKING, AND CLUSTERING [J].
CHENG, YZ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1995, 17 (08) :790-799
[9]  
Ciaccia P, 1997, PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, P426
[10]  
Daszykowski M, 2009, COMPREHENSIVE CHEMOMETRICS: CHEMICAL AND BIOCHEMICAL DATA ANALYSIS, VOLS 1-4, pA635