Inference from clustering with application to gene-expression microarrays

被引:113
作者
Dougherty, ER
Barrera, J
Brun, M
Kim, S
Cesar, RM
Chen, YD
Bittner, M
Trent, JM
机构
[1] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77843 USA
[2] Univ Sao Paulo, Dept Ciencia Comp, Sao Paulo, Brazil
[3] NIH, Natl Human Genome Res Inst, Bethesda, MD 20892 USA
关键词
clustering; gene expression; microarray;
D O I
10.1089/10665270252833217
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.
引用
收藏
页码:105 / 126
页数:22
相关论文
共 12 条
  • [1] Clustering gene expression patterns
    Ben-Dor, A
    Shamir, R
    Yakhini, Z
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (3-4) : 281 - 297
  • [2] Molecular classification of cutaneous malignant melanoma by gene expression profiling
    Bittner, M
    Meitzer, P
    Chen, Y
    Jiang, Y
    Seftor, E
    Hendrix, M
    Radmacher, M
    Simon, R
    Yakhini, Z
    Ben-Dor, A
    Sampas, N
    Dougherty, E
    Wang, E
    Marincola, F
    Gooden, C
    Lueders, J
    Glatfelter, A
    Pollock, P
    Carpten, J
    Gillanders, E
    Leja, D
    Dietrich, K
    Beaudry, C
    Berens, M
    Alberts, D
    Sondak, V
    Hayward, N
    Trent, J
    [J]. NATURE, 2000, 406 (6795) : 536 - 540
  • [3] Duda R. O., 2000, Pattern Classification and Scene Analysis, V2nd
  • [4] Cluster analysis and display of genome-wide expression patterns
    Eisen, MB
    Spellman, PT
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) : 14863 - 14868
  • [5] The transcriptional program in the response of human fibroblasts to serum
    Iyer, VR
    Eisen, MB
    Ross, DT
    Schuler, G
    Moore, T
    Lee, JCF
    Trent, JM
    Staudt, LM
    Hudson, J
    Boguski, MS
    Lashkari, D
    Shalon, D
    Botstein, D
    Brown, PO
    [J]. SCIENCE, 1999, 283 (5398) : 83 - 87
  • [6] Statistical pattern recognition: A review
    Jain, AK
    Duin, RPW
    Mao, JC
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (01) : 4 - 37
  • [7] Data clustering: A review
    Jain, AK
    Murty, MN
    Flynn, PJ
    [J]. ACM COMPUTING SURVEYS, 1999, 31 (03) : 264 - 323
  • [8] Jain K, 1988, Algorithms for clustering data
  • [9] RATES OF CONVERGENCE IN THE SOURCE-CODING THEOREM, IN EMPIRICAL QUANTIZER DESIGN, AND IN UNIVERSAL LOSSY SOURCE-CODING
    LINDER, T
    LUGOSI, G
    ZEGER, K
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1994, 40 (06) : 1728 - 1740
  • [10] Lugosi G, 1996, ANN STAT, V24, P687