Imputation Strategies for Clustering Mixed-Type Data with Missing Values

被引:11
作者
Aschenbruck, Rabea [1 ]
Szepannek, Gero [1 ]
Wilhelm, Adalbert F. X. [2 ]
机构
[1] Univ Appl Sci, Hsch Stralsund, Schwedenschanze 15, D-18435 Stralsund, Germany
[2] Jacobs Univ Bremen, Campus Ring 1, D-28759 Bremen, Germany
关键词
Clustering; Imputation; Mixed-type data; Missing values; K-MEANS; INITIALIZATION;
D O I
10.1007/s00357-022-09422-y
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Incomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed in a simulation study with respect to the factors of partition, prototypes, imputed values, and cluster assignment. Both approaches are based on the k-prototypes algorithm (an extension of k-means), which is one of the most common clustering methods for mixed-type data (i.e., numerical and categorical variables). For k-means clustering of incomplete data, the k-POD algorithm recently has been proposed, which imputes the missings with values of the associated cluster center. We derive an adaptation of the latter and additionally present a cluster aggregation strategy after multiple imputation. It turns out that even a simplified and time-saving variant of the presented method can compete with multiple imputation and subsequent pooling.
引用
收藏
页码:2 / 24
页数:23
相关论文
共 40 条
[1]  
Agresti A., 2007, CATEGORICAL DATA ANA, V2nd, DOI [DOI 10.1002/0470114754, 10.1002/9780470114759.ch5, DOI 10.1002/9780470114759.CH5]
[2]   Survey of State-of-the-Art Mixed Data Clustering Algorithms [J].
Ahmad, Amir ;
Khan, Shehroz S. .
IEEE ACCESS, 2019, 7 :31883-31902
[3]  
[Anonymous], 2020, R LANG ENV STAT COMP
[4]  
Aschenbruck R, 2020, V6, P02, DOI [10.5445/ksp/1000098011/02, 10.5445/KSP/1000098011/02, DOI 10.5445/KSP/1000098011/02]
[5]  
Audigier V., 2020, CLUSTERING MISSING D, DOI [10.48550/arXiv.2011.13694, DOI 10.48550/ARXIV.2011.13694]
[6]   A Framework for Multiple Imputation in Cluster Analysis [J].
Basagana, Xavier ;
Barrera-Gomez, Jose ;
Benet, Marta ;
Anto, Josep M. ;
Garcia-Aymerich, Judith .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2013, 177 (07) :718-725
[7]   MULTIPLE IMPUTATION FOR NONRESPONSE IN SURVEYS - RUBIN,DB [J].
CAMPION, WM .
JOURNAL OF MARKETING RESEARCH, 1989, 26 (04) :485-486
[8]  
Carpenter JA, 2013, STUD WORLD CHR SER, P1
[9]   OpenML: An R package to connect to the machine learning platform OpenML [J].
Casalicchio, Giuseppe ;
Bossek, Jakob ;
Lang, Michel ;
Kirchhoff, Dominik ;
Kerschke, Pascal ;
Hofner, Benjamin ;
Seibold, Heidi ;
Vanschoren, Joaquin ;
Bischl, Bernd .
COMPUTATIONAL STATISTICS, 2019, 34 (03) :977-991
[10]   k-POD: A Method for k-Means Clustering of Missing Data [J].
Chi, Jocelyn T. ;
Chi, Eric C. ;
Baraniuk, Richard G. .
AMERICAN STATISTICIAN, 2016, 70 (01) :91-99