A semiparametric method for clustering mixed data

被引:54
作者
Foss, Alex [1 ]
Markatou, Marianthi [1 ]
Ray, Bonnie [2 ]
Heching, Aliza [2 ]
机构
[1] SUNY Buffalo, Dept Biostat, Buffalo, NY 14620 USA
[2] IBM TJ Watson Res Ctr, Yorktown Hts, NY USA
关键词
Clustering; Unsupervised learning; Mixed data; k-means; Finite mixture models; Big data; DISCRIMINANT-ANALYSIS; MODEL; MIXTURES; SIMILARITY; VARIABLES; ALGORITHM;
D O I
10.1007/s10994-016-5575-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data are generally unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions. We develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses this fundamental problem directly. We study theoretical aspects of our method and demonstrate its effectiveness in a series of Monte Carlo simulation studies and a set of real-world applications.
引用
收藏
页码:419 / 458
页数:40
相关论文
共 68 条
[11]  
[Anonymous], 2015, UCI MACHINE LEARNING
[12]  
ART D, 1982, UTILITAS MATHEMATICA, V21, P75
[13]  
Azzalini A, 2014, J STAT SOFTW, V57
[14]   Clustering via nonparametric density estimation [J].
Azzalini, Adelchi ;
Torelli, Nicola .
STATISTICS AND COMPUTING, 2007, 17 (01) :71-80
[15]  
Blumenson L.E., 1960, The American Mathematical Monthly, V67, P63, DOI [DOI 10.2307/2308932, 10.2307/2308932]
[16]   Semiparametric estimation of a two-component mixture model [J].
Bordes, Laurent ;
Mottelet, Stephane ;
Vandekerkhove, Pierre .
ANNALS OF STATISTICS, 2006, 34 (03) :1204-1232
[17]  
Bowman A.W., 1997, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations, V18
[18]   Model-based clustering, classification, and discriminant analysis of data with mixed type [J].
Browne, Ryan P. ;
McNicholas, Paul D. .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2012, 142 (11) :2976-2984
[19]  
BURNABY TP, 1970, J INT ASS MATH GEOL, V2, P25
[20]  
Calinski T., 1974, Commun StatTheory Methods, V3, P1, DOI DOI 10.1080/03610927408827101