Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach

被引:210
作者
Pihur, Vasyl [1 ]
Datta, Susmita [1 ]
Datta, Somnath [1 ]
机构
[1] Univ Louisville, Dept Bioinformat & Biostat, Louisville, KY 40202 USA
关键词
D O I
10.1093/bioinformatics/btm158
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. Results: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k.
引用
收藏
页码:1607 / 1615
页数:9
相关论文
共 24 条
[1]   Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression [J].
Abba, MC ;
Drake, JA ;
Hawkins, KA ;
Hu, YH ;
Sun, HX ;
Notcovich, C ;
Gaddis, S ;
Sahin, A ;
Baggerly, K ;
Aldaz, CM .
BREAST CANCER RESEARCH, 2004, 6 (05) :R499-R513
[2]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[3]   The transcriptional program of sporulation in budding yeast [J].
Chu, S ;
DeRisi, J ;
Eisen, M ;
Mulholland, J ;
Botstein, D ;
Brown, PO ;
Herskowitz, I .
SCIENCE, 1998, 282 (5389) :699-705
[4]   Comparisons and validation of statistical clustering techniques for microarray gene expression data [J].
Datta, S ;
Datta, S .
BIOINFORMATICS, 2003, 19 (04) :459-466
[5]   Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes [J].
Datta, Susmita ;
Datta, Somnath .
BMC BIOINFORMATICS, 2006, 7 (1)
[6]   A tutorial on the cross-entropy method [J].
De Boer, PT ;
Kroese, DP ;
Mannor, S ;
Rubinstein, RY .
ANNALS OF OPERATIONS RESEARCH, 2005, 134 (01) :19-67
[7]  
Dunn J. C., 1974, Journal of Cybernetics, V4, P95, DOI 10.1080/01969727408546059
[8]   Comparing top k lists [J].
Fagin, R ;
Kumar, R ;
Sivakumar, D .
SIAM JOURNAL ON DISCRETE MATHEMATICS, 2003, 17 (01) :134-160
[9]  
Handl J, 2005, LECT NOTES COMPUT SC, V3410, P547
[10]   Computational cluster validation in post-genomic data analysis [J].
Handl, J ;
Knowles, J ;
Kell, DB .
BIOINFORMATICS, 2005, 21 (15) :3201-3212