Stratified feature sampling method for ensemble clustering of high dimensional data

被引:60
作者
Jing, Liping [1 ]
Tian, Kuang [1 ]
Huang, Joshua Z. [2 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Stratified sampling; Ensemble clustering; High dimensional data; Consensus function; CLASS DISCOVERY; CLASSIFICATION; PREDICTION; CONSENSUS; SELECTION; CANCER;
D O I
10.1016/j.patcog.2015.05.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High dimensional data with thousands of features present a big challenge to current clustering algorithms. Sparsity, noise and correlation of features are common characteristics of such data. Another common phenomenon is that clusters in such high dimensional data often exist in different subspaces. Ensemble clustering is emerging as a prominent technique for improving robustness, stability and accuracy of high dimensional data clustering. In this paper, we propose a stratified sampling method for generating subspace component data sets in ensemble clustering of high dimensional data. Instead of randomly sampling a subset of features for each component data set, in this method we first cluster the features of high dimensional data into a few feature groups called feature strata. Using stratified sampling, we randomly sample some features from each feature stratum and merge the sampled features from different feature strata to generate a component data set. In this way, the component data sets have better representations of the clustering structure in the original data set. Comparing with random sampling and random projection methods in synthetic data analysis, the component clustering by stratified sampling has demonstrated that the average clustering accuracy was increased without sacrificing clustering diversity. We carried out a series of experiments on eight real world data sets from microarray, text and image domains to evaluate ensemble clustering methods using three subspace component data generation methods and four consensus functions. The experimental results consistently showed that the stratified sampling method produced the best ensemble clustering results in all data sets. The ensemble clustering with stratified sampling also outperformed three other ensemble clustering methods which generate component clusters from the entire space of the original data. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3688 / 3702
页数:15
相关论文
共 41 条
[1]  
Aggarwal CC, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P61, DOI 10.1145/304181.304188
[2]  
[Anonymous], P NIPS
[3]  
[Anonymous], P ICML
[4]  
[Anonymous], P ICML
[5]  
[Anonymous], P AAAI
[6]  
[Anonymous], P ACM SIGMOD
[7]  
[Anonymous], P C DES AUT
[8]  
Ayad H, 2003, LECT NOTES COMPUT SC, V2709, P166
[9]   Optimized stratified sampling for approximate query processing [J].
Chaudhuri, Surajit ;
Das, Gautam ;
Narasayya, Vivek .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2007, 32 (02)
[10]  
Domeniconi C., 2009, ACM Trans Knowl Discov Data, V2, P1, DOI DOI 10.1145/1460797.1460800