EnsCat: clustering of categorical data via ensembling

被引：0

作者：

Clarke, Bertrand S. ^{[1
]}

Amiri, Saeid ^{[2
]}

Clarke, Jennifer L. ^{[1
,3
]}

机构：

[1] Univ Nebraska Lincoln, Dept Stat, Lincoln, NE 68588 USA

[2] Univ Wisconsin Madison, Dept Nat & Appl Sci, Iowa City, IA USA

[3] Univ Nebraska Lincoln, Dept Food Sci & Technol, Lincoln, NE 68588 USA

来源：

BMC BIOINFORMATICS | 2016年 / 17卷

基金：

美国国家科学基金会;

关键词：

Categorical data; Clustering; Ensembling methods; High dimensional data; ALGORITHM;

D O I：

10.1186/s12859-016-1245-9

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. Results: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. Conclusions: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat.

引用

页数：13

共 50 条

[41] A k-populations algorithm for clustering categorical data
Kim, DW
Lee, K
Lee, D
Lee, KH
PATTERN RECOGNITION, 2005, 38 (07) : 1131 - 1134
[42] An Integrated Clustering Approach for High Dimensional Categorical Data
Kalaivani, K.
Raghavendra, A. P. V.
2013 IEEE INTERNATIONAL CONFERENCE ON GREEN HIGH PERFORMANCE COMPUTING (ICGHPC), 2013,
[43] Performances of parallel clustering algorithm for categorical and mixed data
Hai, NTM
Susumu, H
PARALLEL AND DISTRIBUTED COMPUTING: APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2004, 3320 : 252 - 256
[44] Soft subspace clustering of categorical data with probabilistic distance
Chen, Lifei
Wang, Shengrui
Wang, Kaijun
Zhu, Jianping
PATTERN RECOGNITION, 2016, 51 : 322 - 332
[45] Integrated Rough Fuzzy Clustering for Categorical data Analysis
Saha, Indrajit
Sarkar, Jnanendra Prasad
Maulik, Ujjwal
FUZZY SETS AND SYSTEMS, 2019, 361 : 1 - 32
[46] CLUSTERING CATEGORICAL DATA BASED ON COMBINATIONS OF ATTRIBUTE VALUES
Do, Hee-Jung
Kim, Jae Yearn
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2009, 5 (12A): : 4393 - 4405
[47] A fair-multicluster approach to clustering of categorical data
Carlos Santos-Mangudo
Antonio J. Heras
Central European Journal of Operations Research, 2023, 31 : 583 - 604
[48] A data labeling method for clustering categorical data
Cao, Fuyuan
Liang, Jiye
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (03) : 2381 - 2385
[49] From Whole to Part: Reference-Based Representation for Clustering Categorical Data
Zheng, Qibin
Diao, Xingchun
Cao, Jianjun
Liu, Yi
Li, Hongmei
Yao, Junnan
Chang, Chen
Lv, Guojun
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (03) : 927 - 937
[50] Categorical data clustering: A correlation-based approach for unsupervised attribute weighting
Carbonera, Joel Luis
Abel, Mara
2014 IEEE 26TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2014, : 259 - 263

← 1 2 3 4 5 →