An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood

被引：73

作者：

Ding, Shifei ^{[1
]}

Du, Mingjing ^{[1
]}

Sun, Tongfeng ^{[1
]}

Xu, Xiao ^{[1
]}

Xue, Yu ^{[2
]}

机构：

[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Peoples R China

[2] Nanjing Univ Informat Sci & Technol, Sch Comp & Software, Nanjing 210044, Jiangsu, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2017年 / 133卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Entropy; Density peaks clustering; Mixed type data; Fuzzy neighborhood; SIMILARITY;

D O I：

10.1016/j.knosys.2017.07.027

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most clustering algorithms rely on the assumption that data simply contains numerical values. In fact, however, data sets containing both numerical and categorical attributes are ubiquitous in real-world tasks, and effective grouping of such data is an important yet challenging problem. Currently most algorithms are sensitive to initialization and are generally unsuitable for non-spherical distribution data. For this, we propose an entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood (DP-MD-FN). Firstly, we propose a new similarity measure for either categorical or numerical attributes which has a uniform criterion. The similarity measure is proposed to avoid feature transformation and parameter adjustment between categorical and numerical values. We integrate this entropy based strategy with the density peaks clustering method. In addition, to improve the robustness of the original algorithm, we use fuzzy neighborhood relation to redefine the local density. Besides, in order to select the cluster centers automatically, a simple determination strategy is developed through introducing the gamma-graph. This method can deal with three types of data: numerical, categorical, and mixed type data. We compare the performance of our algorithm with traditional clustering algorithms, such as K-Modes, K-Prototypes, KL-FCM-GM, EKP and OCIL. Experiments on different benchmark data sets demonstrate the effectiveness and robustness of the proposed algorithm. (C) 2017 Elsevier B.V. All rights reserved.

引用

页码：294 / 313

页数：20

共 48 条

[1]

[Anonymous], 2010, P IEEE C EV COMP

[2] TOWARD A UNIFIED THEORY OF SIMILARITY AND RECOGNITION [J].

ASHBY, FG ;

PERRIN, NA .

PSYCHOLOGICAL REVIEW, 1988, 95 (01) :124-150

[3]

Barbara D., 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P582, DOI 10.1145/584792.584888

[4] Interpretable hierarchical clustering by constructing an unsupervised decision tree [J].

Basak, J ;

Krishnapuram, R .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (01) :121-132

[5] Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem [J].

Bouras, Christos ;

Tsogkas, Vassilis .

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2016, 7 (02) :171-184

[6] A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional [J].

Chatzis, Sotirios P. .

EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (07) :8684-8689

[7] Parallel Spectral Clustering in Distributed Systems [J].

Chen, Wen-Yen ;

Song, Yangqiu ;

Bai, Hongjie ;

Lin, Chih-Jen ;

Chang, Edward Y. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (03) :568-586

[8] A new method to estimate ages of facial image for large database [J].

Chen, Ye-Wang ;

Lai, De-He ;

Qi, Han ;

Wang, Jiong-Liang ;

Du, Ji-Xiang .

MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (05) :2877-2895

[9]

Cheng C H, 1999, P 5 ACM SIGKDD INT C, P84, DOI DOI 10.1145/312129.312199

[10]

Demsar J, 2006, J MACH LEARN RES, V7, P1

← 1 2 3 4 5 →