Including Probe-Level Measurement Error in Robust Mixture Clustering of Replicated Microarray Gene Expression

被引:3
作者
Liu, Xuejun [1 ]
Rattray, Magnus [2 ,3 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing, Peoples R China
[2] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England
[3] Univ Sheffield, Sheffield Inst Translat Neurosci, Sheffield S10 2TN, S Yorkshire, England
基金
美国国家科学基金会;
关键词
microarray data; gene expression clustering; mixture models; PROBABILISTIC MODEL; CYCLE; IDENTIFICATION; TRANSCRIPTION; YEAST;
D O I
10.2202/1544-6115.1600
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Probabilistic mixture models provide a popular approach to cluster noisy gene expression data for exploring gene function. Since gene expression data obtained from microarray experiments are often associated with significant sources of technical and biological noise, replicated experiments are typically used to deal with data variability, and internal replication (e.g. from multiple probes per gene in an experiment) provides valuable information about technical sources of noise. However, current implementations of mixture models either do not consider the correlation between the replicated measurements for the same experimental condition, or ignore the probe-level measurement error, and thus overlook the rich information about technical noise. Moreover, most current methods use non-robust Gaussian components to describe the data, and these methods are therefore sensitive to non-Gaussian clusters and outliers. In many cases, this will lead to over-estimation of the number of model components as multiple Gaussian components are used to fit a non-Gaussian cluster. We propose a robust Student's t-mixture model, which explicitly handles replicated gene expression data, includes the consideration of probe-level measurement error when available and automatically selects the appropriate number of model components using a minimum message length criterion. We apply the model to gene expression data using probe-level measurements from an Affymetrix probe-level model, multi-mgMOS, which provides uncertainty estimates. The proposed Student's t-mixture model shows robust performance on synthetic data sets with realistic noise characteristics in comparison to a standard Gaussian mixture model and two other previously published methods. We also compare performance with these methods on two yeast time-course data sets and show that the new method obtains more biologically meaningful clusters in terms of enrichment statistics for GO categories and interactions between transcription factors and genes. Automatically selecting the number of components is more computationally efficient than using a model selection approach and allows the methods to be applied to larger data sets.
引用
收藏
页数:23
相关论文
共 45 条
  • [1] *AFF INC, 2002, STAT ALG REF GUID
  • [2] On small-sample confidence intervals for parameters in discrete distributions
    Agresti, A
    Min, YY
    [J]. BIOMETRICS, 2001, 57 (03) : 963 - 971
  • [3] A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes
    Baldi, P
    Long, AD
    [J]. BIOINFORMATICS, 2001, 17 (06) : 509 - 519
  • [4] Analyzing time series gene expression data
    Bar-Joseph, Z
    [J]. BIOINFORMATICS, 2004, 20 (16) : 2493 - 2503
  • [5] DAVID: Database for annotation, visualization, and integrated discovery
    Dennis, G
    Sherman, BT
    Hosack, DA
    Yang, J
    Gao, W
    Lane, HC
    Lempicki, RA
    [J]. GENOME BIOLOGY, 2003, 4 (09)
  • [6] Cluster analysis and display of genome-wide expression patterns
    Eisen, MB
    Spellman, PT
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) : 14863 - 14868
  • [7] Clustering short time series gene expression data
    Ernst, J
    Nau, GJ
    Bar-Joseph, Z
    [J]. BIOINFORMATICS, 2005, 21 : I159 - I168
  • [8] Unsupervised learning of finite mixture models
    Figueiredo, MAT
    Jain, AK
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (03) : 381 - 396
  • [9] MCLUST: Software for model-based cluster analysis
    Fraley, C
    Raftery, AE
    [J]. JOURNAL OF CLASSIFICATION, 1999, 16 (02) : 297 - 306
  • [10] Transcriptional regulatory code of a eukaryotic genome
    Harbison, CT
    Gordon, DB
    Lee, TI
    Rinaldi, NJ
    Macisaac, KD
    Danford, TW
    Hannett, NM
    Tagne, JB
    Reynolds, DB
    Yoo, J
    Jennings, EG
    Zeitlinger, J
    Pokholok, DK
    Kellis, M
    Rolfe, PA
    Takusagawa, KT
    Lander, ES
    Gifford, DK
    Fraenkel, E
    Young, RA
    [J]. NATURE, 2004, 431 (7004) : 99 - 104