A data-driven clustering method for time course gene expression data

被引:115
作者
Ma, P [1 ]
Castillo-Davis, CI [1 ]
Zhong, WX [1 ]
Liu, JS [1 ]
机构
[1] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
D O I
10.1093/nar/gkl013
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Gene expression over time is, biologically, a continuous process and can thus be represented by a continuous function, i.e. a curve. Individual genes often share similar expression patterns (functional forms). However, the shape of each function, the number of such functions, and the genes that share similar functional forms are typically unknown. Here we introduce an approach that allows direct discovery of related patterns of gene expression and their underlying functions (curves) from data without a priori specification of either cluster number or functional form. Smoothing spline clustering (SSC) models natural properties of gene expression over time, taking into account natural differences in gene expression within a cluster of similarly expressed genes, the effects of experimental measurement error, and missing data. Furthermore, SSC provides a visual summary of each cluster's gene expression function and goodness-of-fit by way of a 'mean curve' construct and its associated confidence bands. We apply this method to gene expression data over the life-cycle of Drosophila melanogaster and Caenorhabditis elegans to discover 17 and 16 unique patterns of gene expression in each species, respectively. New and previously described expression patterns in both species are discovered, the majority of which are biologically meaningful and exhibit statistically significant gene function enrichment. Software and source code implementing the algorithm, SSClust, is freely available (http://genemerge.bioteam.net/SSClust.html).
引用
收藏
页码:1261 / 1269
页数:9
相关论文
共 30 条
[11]  
Hartigan J. A., 1979, Applied Statistics, V28, P100, DOI 10.2307/2346830
[12]   Bayesian coclustering of Anopheles gene expression time series:: Study of immune defense response to multiple experimental challenges [J].
Heard, NA ;
Holmes, CC ;
Stephens, DA ;
Hand, DJ ;
Dimopoulos, G .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (47) :16939-16944
[13]   DEVELOPMENT OF REPRODUCTIVE-SYSTEM OF CAENORHABDITIS-ELEGANS [J].
HIRSH, D ;
OPPENHEIM, D ;
KLASS, M .
DEVELOPMENTAL BIOLOGY, 1976, 49 (01) :200-219
[14]   Clustering for sparsely sampled functional data [J].
James, GM ;
Sugar, CA .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2003, 98 (462) :397-408
[15]   Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans [J].
Jiang, M ;
Ryu, J ;
Kiraly, M ;
Duke, K ;
Reinke, V ;
Kim, SK .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (01) :218-223
[16]  
JOHNSON RA, 2002, APPL MULTIVARIATE ST, P234
[17]  
Kohonen T., 1997, Self-organizing Maps, V2nd ed.
[18]  
Liu JS, 1998, J AM STAT ASSOC, V93, P1022
[19]   Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data [J].
Luan, Y ;
Li, H .
BIOINFORMATICS, 2004, 20 (03) :332-339
[20]   Clustering of time-course gene expression data using a mixed-effects model with B-splines [J].
Luan, YH ;
Li, HZ .
BIOINFORMATICS, 2003, 19 (04) :474-482