Imputation Strategies for Clustering Mixed-Type Data with Missing Values

被引：11

作者：

Aschenbruck, Rabea ^{[1
]}

Szepannek, Gero ^{[1
]}

Wilhelm, Adalbert F. X. ^{[2
]}

机构：

[1] Univ Appl Sci, Hsch Stralsund, Schwedenschanze 15, D-18435 Stralsund, Germany

[2] Jacobs Univ Bremen, Campus Ring 1, D-28759 Bremen, Germany

来源：

JOURNAL OF CLASSIFICATION | 2023年 / 40卷 / 01期

关键词：

Clustering; Imputation; Mixed-type data; Missing values; K-MEANS; INITIALIZATION;

D O I：

10.1007/s00357-022-09422-y

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

Incomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed in a simulation study with respect to the factors of partition, prototypes, imputed values, and cluster assignment. Both approaches are based on the k-prototypes algorithm (an extension of k-means), which is one of the most common clustering methods for mixed-type data (i.e., numerical and categorical variables). For k-means clustering of incomplete data, the k-POD algorithm recently has been proposed, which imputes the missings with values of the associated cluster center. We derive an adaptation of the latter and additionally present a cluster aggregation strategy after multiple imputation. It turns out that even a simplified and time-saving variant of the presented method can compete with multiple imputation and subsequent pooling.

引用

页码：2 / 24

页数：23

共 40 条

[1]

Agresti A., 2007, CATEGORICAL DATA ANA, V2nd, DOI [DOI 10.1002/0470114754, 10.1002/9780470114759.ch5, DOI 10.1002/9780470114759.CH5]

[2] Survey of State-of-the-Art Mixed Data Clustering Algorithms [J].

Ahmad, Amir ;

Khan, Shehroz S. .

IEEE ACCESS, 2019, 7 :31883-31902

[3]

[Anonymous], 2020, R LANG ENV STAT COMP

[4]

Aschenbruck R, 2020, V6, P02, DOI [10.5445/ksp/1000098011/02, 10.5445/KSP/1000098011/02, DOI 10.5445/KSP/1000098011/02]

[5]

Audigier V., 2020, CLUSTERING MISSING D, DOI [10.48550/arXiv.2011.13694, DOI 10.48550/ARXIV.2011.13694]

[6] A Framework for Multiple Imputation in Cluster Analysis [J].

Basagana, Xavier ;

Barrera-Gomez, Jose ;

Benet, Marta ;

Anto, Josep M. ;

Garcia-Aymerich, Judith .

AMERICAN JOURNAL OF EPIDEMIOLOGY, 2013, 177 (07) :718-725

[7] MULTIPLE IMPUTATION FOR NONRESPONSE IN SURVEYS - RUBIN,DB [J].

CAMPION, WM .

JOURNAL OF MARKETING RESEARCH, 1989, 26 (04) :485-486

[8]

Carpenter JA, 2013, STUD WORLD CHR SER, P1

[9] OpenML: An R package to connect to the machine learning platform OpenML [J].

Casalicchio, Giuseppe ;

Bossek, Jakob ;

Lang, Michel ;

Kirchhoff, Dominik ;

Kerschke, Pascal ;

Hofner, Benjamin ;

Seibold, Heidi ;

Vanschoren, Joaquin ;

Bischl, Bernd .

COMPUTATIONAL STATISTICS, 2019, 34 (03) :977-991

[10] k-POD: A Method for k-Means Clustering of Missing Data [J].

Chi, Jocelyn T. ;

Chi, Eric C. ;

Baraniuk, Richard G. .

AMERICAN STATISTICIAN, 2016, 70 (01) :91-99

← 1 2 3 4 →