Distance Functions, Clustering Algorithms and Microarray Data Analysis

被引：23

作者：

Giancarlo, Raffaele ^{[1
]}

Lo Bosco, Giosue ^{[1
]}

Pinello, Luca ^{[1
]}

机构：

[1] Univ Palermo, Dipartimento Matemat & Informat, I-90133 Palermo, Italy

来源：

LEARNING AND INTELLIGENT OPTIMIZATION | 2010年 / 6073卷

关键词：

GENE-EXPRESSION DATA; VALIDATION;

D O I：

10.1007/978-3-642-13800-3_10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function "works best" has been investigated, but no final conclusion has been reached. The aim of this paper is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic separation ability and (b) their predictive power when used in conjunction with clustering algorithms. The experiments have been carried out on six benchmark microarray datasets, where the "gold solution" is known for each of them. We have used both Hierarchical and K-means clustering algorithms and external validation criteria as evaluation tools. From the methodological point of view, the main result of this study is a ranking of those measures in terms of their intrinsic and clustering abilities, highlighting also the correlations between the two. Pragmatically, based on the outcomes of the experiments, one receives the indication that Minkowski, cosine and Pearson correlation distances seems to be the best choice when dealing with microarray data analysis.

引用

页码：125 / 138

页数：14

共 28 条

[1] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Alizadeh, AA
Eisen, MB
Davis, RE
Ma, C
Lossos, IS
Rosenwald, A
Boldrick, JG
Sabet, H
Tran, T
Yu, X
Powell, JI
Yang, LM
Marti, GE
Moore, T
Hudson, J
Lu, LS
Lewis, DB
Tibshirani, R
Sherlock, G
Chan, WC
Greiner, TC
Weisenburger, DD
Armitage, JO
Warnke, R
Levy, R
Wilson, W
Grever, MR
Byrd, JC
Botstein, D
Brown, PO
Staudt, LM
[J]. NATURE, 2000, 403 (6769) : 503 - 511
[2] [Anonymous], THESIS U WASHINGTON
[3] [Anonymous], CURRENT TOPICS COMPU
[4] CHEN JY, 2009, BIOL DATA MINING STA, P295
[5] Comparative analysis of clustering methods for gene expression time course data
Costa, IG
de Carvalho, FDT
de Souto, MCP
[J]. GENETICS AND MOLECULAR BIOLOGY, 2004, 27 (04) : 623 - 631
[6] Cover T. M., 1999, Elements of information theory
[7] How does gene expression clustering work?
D'haeseleer, P
[J]. NATURE BIOTECHNOLOGY, 2005, 23 (12) : 1499 - 1501
[8] Estimating mutual information using B-spline functions - an improved similarity measure for analysing gene expression data
Daub, CO
Steuer, R
Selbig, J
Kloska, S
[J]. BMC BIOINFORMATICS, 2004, 5 (1)
[9] Deza E., 2006, Dictionary of Distances
[10] GenClust:: A genetic algorithm for clustering gene expression data -: art. no. 289
Di Gesú, V
Giancarlo, R
Lo Bosco, G
Raimondi, A
Scaturro, D
[J]. BMC BIOINFORMATICS, 2005, 6 (1)

← 1 2 3 →