Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach

被引：210

作者：

Pihur, Vasyl ^{[1
]}

Datta, Susmita ^{[1
]}

Datta, Somnath ^{[1
]}

机构：

[1] Univ Louisville, Dept Bioinformat & Biostat, Louisville, KY 40202 USA

来源：

BIOINFORMATICS | 2007年 / 23卷 / 13期

关键词：

D O I：

10.1093/bioinformatics/btm158

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. Results: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k.

引用

页码：1607 / 1615

页数：9

共 24 条

[1] Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression [J].

Abba, MC ;

Drake, JA ;

Hawkins, KA ;

Hu, YH ;

Sun, HX ;

Notcovich, C ;

Gaddis, S ;

Sahin, A ;

Baggerly, K ;

Aldaz, CM .

BREAST CANCER RESEARCH, 2004, 6 (05) :R499-R513

[2] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].

BANFIELD, JD ;

RAFTERY, AE .

BIOMETRICS, 1993, 49 (03) :803-821

[3] The transcriptional program of sporulation in budding yeast [J].

Chu, S ;

DeRisi, J ;

Eisen, M ;

Mulholland, J ;

Botstein, D ;

Brown, PO ;

Herskowitz, I .

SCIENCE, 1998, 282 (5389) :699-705

[4] Comparisons and validation of statistical clustering techniques for microarray gene expression data [J].

Datta, S ;

Datta, S .

BIOINFORMATICS, 2003, 19 (04) :459-466

[5] Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes [J].

Datta, Susmita ;

Datta, Somnath .

BMC BIOINFORMATICS, 2006, 7 (1)

[6] A tutorial on the cross-entropy method [J].

De Boer, PT ;

Kroese, DP ;

Mannor, S ;

Rubinstein, RY .

ANNALS OF OPERATIONS RESEARCH, 2005, 134 (01) :19-67

[7]

Dunn J. C., 1974, Journal of Cybernetics, V4, P95, DOI 10.1080/01969727408546059

[8] Comparing top k lists [J].

Fagin, R ;

Kumar, R ;

Sivakumar, D .

SIAM JOURNAL ON DISCRETE MATHEMATICS, 2003, 17 (01) :134-160

[9]

Handl J, 2005, LECT NOTES COMPUT SC, V3410, P547

[10] Computational cluster validation in post-genomic data analysis [J].

Handl, J ;

Knowles, J ;

Kell, DB .

BIOINFORMATICS, 2005, 21 (15) :3201-3212

← 1 2 3 →