A semiparametric method for clustering mixed data

被引:54
作者
Foss, Alex [1 ]
Markatou, Marianthi [1 ]
Ray, Bonnie [2 ]
Heching, Aliza [2 ]
机构
[1] SUNY Buffalo, Dept Biostat, Buffalo, NY 14620 USA
[2] IBM TJ Watson Res Ctr, Yorktown Hts, NY USA
关键词
Clustering; Unsupervised learning; Mixed data; k-means; Finite mixture models; Big data; DISCRIMINANT-ANALYSIS; MODEL; MIXTURES; SIMILARITY; VARIABLES; ALGORITHM;
D O I
10.1007/s10994-016-5575-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data are generally unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions. We develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses this fundamental problem directly. We study theoretical aspects of our method and demonstrate its effectiveness in a series of Monte Carlo simulation studies and a set of real-world applications.
引用
收藏
页码:419 / 458
页数:40
相关论文
共 68 条
[1]   A k-mean clustering algorithm for mixed numeric and categorical data [J].
Ahmad, Amir ;
Dey, Lipika .
DATA & KNOWLEDGE ENGINEERING, 2007, 63 (02) :503-527
[2]  
[Anonymous], SANKHYA A
[3]  
[Anonymous], AD NEURAL INFORM PRO
[4]  
[Anonymous], P KDD
[5]  
[Anonymous], P 12 INT C MACH LEAR
[6]  
[Anonymous], 2000, WILEY SERIES PROBABI
[7]  
[Anonymous], DECISION MAKING SERV
[8]  
[Anonymous], KOREAN COMMUNICATION
[9]  
[Anonymous], 2008, Introduction to information retrieval
[10]  
[Anonymous], 2012, Technical Report No. 597