Initial Seed Selection for Mixed Data Using Modified K-means Clustering Algorithm

被引:5
作者
Sajidha, S. A. [1 ]
Desikan, Kalyani [2 ]
Chodnekar, Siddha Prabhu [1 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn, Chennai, Tamil Nadu, India
[2] Vellore Inst Technol, Sch Adv Sci, Chennai, Tamil Nadu, India
关键词
Initial seed points; K-means; K-prototypes; Clustering; Mixed attributes;
D O I
10.1007/s13369-019-04121-0
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Data sets to which clustering is applied may be homogeneous (numerical or categorical) or heterogeneous (numerical and categorical) in nature. Handling homogeneous data is easier than heterogeneous data. We propose a novel technique for identifying initial seeds for heterogeneous data clustering, through the introduction of a unique distance measure where the distance of the numerical attributes is scaled such that it is comparable to that of categorical attributes. The proposed initial seed selection algorithm ensures selection of initial seed points from different clusters of the clustering solution which are then given as input to the modified K-means clustering algorithm along with the data set. This technique is independent of any user-defined parameter and thus can be easily applied to clusterable data sets with mixed attributes. We have also modified the K-means clustering algorithm to handle mixed attributes by incorporating our novel distance measure to handle numerical data and assigned the value one or zero when categorical data is dissimilar or similar. Finally, a comparison has been made with existing algorithms to bring out the significance of our approach. We also perform a statistical test to evaluate the statistical significance of our proposed technique.
引用
收藏
页码:2685 / 2703
页数:19
相关论文
共 14 条
  • [1] [Anonymous], 2010, P IEEE C EV COMP
  • [2] A simple density with distance based initial seed selection technique for K means algorithm
    Azimuddin S.S.
    Desikan K.
    [J]. 2017, University of Zagreb, Faculty of Political Sciences (25) : 291 - 300
  • [3] A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional
    Chatzis, Sotirios P.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (07) : 8684 - 8689
  • [4] A novel cluster center fast determination clustering algorithm
    Chen Jinyin
    Lin Xiang
    Zheng Haibing
    Bao Xintong
    [J]. APPLIED SOFT COMPUTING, 2017, 57 : 539 - 555
  • [5] Applying subclustering and Lp distance in Weighted K-Means with distributed centroids
    de Amorim, Renato Cordeiro
    Makarenkov, Vladimir
    [J]. NEUROCOMPUTING, 2016, 173 : 700 - 707
  • [6] Huang DJ, 2018, FUNCT FOOD SCI TECHN, P21
  • [7] Huang Zhexue., 1997, DMKD
  • [8] An improved k-prototypes clustering algorithm for mixed numeric and categorical data
    Ji, Jinchao
    Bai, Tian
    Zhou, Chunguang
    Ma, Chao
    Wang, Zhe
    [J]. NEUROCOMPUTING, 2013, 120 : 590 - 596
  • [9] A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data
    Ji, Jinchao
    Pang, Wei
    Zhou, Chunguang
    Han, Xiao
    Wang, Zhe
    [J]. KNOWLEDGE-BASED SYSTEMS, 2012, 30 : 129 - 135
  • [10] Lichman M., 2013, UCI MACHINE LEARNING