A novel density peaks clustering algorithm for mixed data

被引:46
作者
Du, Mingjing [1 ]
Ding, Shifei [1 ,2 ]
Xue, Yu [3 ]
机构
[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Sch Comp & Software, Nanjing 210044, Jiangsu, Peoples R China
关键词
Data clustering; Density peaks; Entropy; Mixed data; SIMILARITY;
D O I
10.1016/j.patrec.2017.07.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The density peaks clustering (DPC) algorithm is well known for its power on non-spherical distribution data sets. However, it works only on numerical values. This prohibits it from being used to cluster real world data containing categorical values and numerical values. Traditional clustering algorithms for mixed data use a pre-processing based on binary encoding. But such methods destruct the original structure of categorical attributes. Other solutions based on simple matching, such as K-Prototypes, need a userdefined parameter to avoid favoring either type of attribute. In order to overcome these problems, we present a novel clustering algorithm for mixed data, called DPC-MD. We improve DPC by using a new similarity criterion to deal with the three types of data: numerical, categorical, or mixed data. Compared to other methods for mixed data, DPC absolutely has more advantages to deal with non-spherical distribution data. In addition, the core of the proposed method is based on a new similarity measure for mixed data. This similarity measure is proposed to avoid feature transformation and parameter adjustment. The performance of our method is demonstrated by experiments on some real-world datasets in comparison with that of traditional clustering algorithms, such as K-Modes, K-Prototypes EKP and SBAC. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:46 / 53
页数:8
相关论文
共 34 条
  • [1] TOWARD A UNIFIED THEORY OF SIMILARITY AND RECOGNITION
    ASHBY, FG
    PERRIN, NA
    [J]. PSYCHOLOGICAL REVIEW, 1988, 95 (01) : 124 - 150
  • [2] Barbara D., 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P582, DOI 10.1145/584792.584888
  • [3] Interpretable hierarchical clustering by constructing an unsupervised decision tree
    Basak, J
    Krishnapuram, R
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (01) : 121 - 132
  • [4] A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional
    Chatzis, Sotirios P.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (07) : 8684 - 8689
  • [5] Robust support vector data description for outlier detection with noise or uncertain data
    Chen, Guijun
    Zhang, Xueying
    Wang, Zizhong John
    Li, Fenglian
    [J]. KNOWLEDGE-BASED SYSTEMS, 2015, 90 : 129 - 137
  • [6] An ordered clustering algorithm based on K-means and the PROMETHEE method
    Chen, Liuhao
    Xu, Zeshui
    Wang, Hai
    Liu, Shousheng
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2018, 9 (06) : 917 - 926
  • [7] Density peaks clustering using geodesic distances
    Du, Mingjing
    Ding, Shifei
    Xu, Xiao
    Xue, Yu
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2018, 9 (08) : 1335 - 1349
  • [8] Study on density peaks clustering based on k-nearest neighbors and principal component analysis
    Du, Mingjing
    Ding, Shifei
    Jia, Hongjie
    [J]. KNOWLEDGE-BASED SYSTEMS, 2016, 99 : 135 - 145
  • [9] Generalizing self-organizing map for categorical data
    Hsu, CC
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2006, 17 (02): : 294 - 304
  • [10] Huang Z., 1997, Research Issues on Data Mining and Knowledge Discovery, P1