Exploring the diversity in cluster ensemble generation: Random sampling and random projection

被引:42
作者
Yang, Fan [1 ]
Li, Xuan [1 ]
Li, Qianmu [2 ]
Li, Tao [3 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Engn, Xiamen 361005, Peoples R China
[2] Nanjing Univ Sci & Technol, Sch Comp Sci, Nanjing 210094, Jiangsu, Peoples R China
[3] Florida Int Univ, Sch Comp Sci, Miami, FL 33199 USA
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Random sampling; Random projection; Ensemble generation; Ensemble clustering; CONSENSUS; FRAMEWORK;
D O I
10.1016/j.eswa.2014.01.028
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cluster ensemble first generates a large library of different clustering solutions and then combines them into a more accurate consensus clustering. It is commonly accepted that for cluster ensemble to work well the member partitions should be different from each other, and meanwhile the quality of each partition should remain at an acceptable level. Many different strategies have been used to generate different base partitions for cluster ensemble. Similar to ensemble classification, many studies have been focusing on generating different partitions of the original dataset, i.e., clustering on different subsets (e.g., obtained using random sampling) or clustering in different feature spaces (e.g., obtained using random projection). However, little attention has been paid to the diversity and quality of the partitions generated using these two approaches. In this paper, we propose a novel cluster generation method based on random sampling, which uses the nearest neighbor method to fill the category information of the missing samples (abbreviated as RS-NN). We evaluate its performance in comparison with k-means ensemble, a typical random projection method (Random Feature Subset, abbreviated as FS), and another random sampling method (Random Sampling based on Nearest Centroid, abbreviated as RS-NC). Experimental results indicate that the FS method always generates more diverse partitions while RS-NC method generates high-quality partitions. Our proposed method, RS-NN, generates base partitions with a good balance between the quality and the diversity and achieves significant improvement over alternative methods. Furthermore, to introduce more diversity, we propose a dual random sampling method which combines RS-NN and FS methods. The proposed method can achieve higher diversity with good quality on most datasets. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:4844 / 4866
页数:23
相关论文
共 35 条
[1]  
[Anonymous], RESENHAS IME USP
[2]   Locally adaptive metrics for clustering high dimensional data [J].
Domeniconi, Carlotta ;
Gunopulos, Dimitrios ;
Ma, Sheng ;
Yan, Bojun ;
Al-Razgan, Muna ;
Papadopoulos, Dimitris .
DATA MINING AND KNOWLEDGE DISCOVERY, 2007, 14 (01) :63-97
[3]   Bagging to improve the accuracy of a clustering procedure [J].
Dudoit, S ;
Fridlyand, J .
BIOINFORMATICS, 2003, 19 (09) :1090-1099
[4]  
Fern X. Z., 2004, P 21 INT C MACH LEAR, P36, DOI DOI 10.1145/1015330.1015414
[5]  
Fern Xiaoli Zhang, 2003, P 20 INT C MACH LEAR, P186, DOI DOI 10.5555/3041838.3041862
[6]  
Fern XZ, 2008, STAT ANAL DATA MIN, P128, DOI DOI 10.1002/SAM.10008
[7]  
Filkov V., 2004, International Journal on Artificial Intelligence Tools (Architectures, Languages, Algorithms), V13, P863, DOI 10.1142/S0218213004001867
[8]   Path-based clustering for grouping of smooth curves and texture segmentation [J].
Fischer, B ;
Buhmann, JM .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2003, 25 (04) :513-518
[9]   Combining multiple clusterings using evidence accumulation [J].
Fred, ALN ;
Jain, AK .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (06) :835-850
[10]  
Fred ALN, 2002, INT C PATT RECOG, P276, DOI 10.1109/ICPR.2002.1047450