A robustness metric for biological data clustering algorithms

被引:16
作者
Lu, Yuping [1 ]
Phillips, Charles A. [1 ]
Langston, Michael A. [1 ]
机构
[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
基金
美国国家卫生研究院;
关键词
Robustness; Clustering algorithms; Paraclique; GENE-EXPRESSION; VALIDATION; LINKAGE;
D O I
10.1186/s12859-019-3089-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? Results: This work introduces a new metric, termed simply "robustness", designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. Conclusions: Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.
引用
收藏
页数:8
相关论文
共 40 条
[1]  
[Anonymous], BMC BIOINFORMATICS
[2]  
[Anonymous], 2017, R LANG ENV STAT COMP
[3]  
[Anonymous], EMNLP CONLL 2007
[4]  
[Anonymous], 1994, SOCIAL NETWORK ANAL
[5]  
Baratloo A, 2015, EMERGENCY, V3, P48
[6]  
Chen GX, 2002, STAT SINICA, V12, P241
[7]  
Chesler EJ, 2007, LECT NOTES COMPUT SC, V4023, P150
[8]   Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes [J].
Datta, Susmita ;
Datta, Somnath .
BMC BIOINFORMATICS, 2006, 7 (1)
[9]   The complete linkage clustering algorithm revisited [J].
Dawyndt, P ;
De Meyer, H ;
De Baets, B .
SOFT COMPUTING, 2005, 9 (05) :385-392
[10]  
de Vries G. K. D., 2010, Proceedings 2010 10th IEEE International Conference on Data Mining Workshops (ICDMW 2010), P209, DOI 10.1109/ICDMW.2010.123