Principal component analysis based methods in bioinformatics studies

被引:167
作者
Ma, Shuangge [1 ]
Dai, Ying [2 ]
机构
[1] Yale Univ, Sch Publ Hlth, New Haven, CT 06520 USA
[2] Xiamen Univ, Dept Planning & Stat, Sch Econ, Xiamen, Peoples R China
基金
美国国家科学基金会;
关键词
principal component analysis; dimension reduction; bioinformatics methodologies; gene expression; GENE; SURVIVAL;
D O I
10.1093/bib/bbq090
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent 'non-standard' applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
引用
收藏
页码:714 / 722
页数:9
相关论文
共 30 条
[1]  
[Anonymous], 2002, Functional data analysis
[2]   Prediction by supervised principal components [J].
Bair, E ;
Hastie, T ;
Paul, D ;
Tibshirani, R .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2006, 101 (473) :119-137
[3]   Semi-supervised methods to predict patient survival from gene expression data [J].
Bair, E ;
Tibshirani, R .
PLOS BIOLOGY, 2004, 2 (04) :511-522
[4]   COVARIANCE REGULARIZATION BY THRESHOLDING [J].
Bickel, Peter J. ;
Levina, Elizaveta .
ANNALS OF STATISTICS, 2008, 36 (06) :2577-2604
[5]  
Chang WC, 1983, J ROY STAT SOC C, V32, P267, DOI 10.2307/2347949
[6]   Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes [J].
Chen, Xi ;
Wang, Lily ;
Smith, Jonathan D. ;
Zhang, Bing .
BIOINFORMATICS, 2008, 24 (21) :2474-2481
[7]   Pathway-Based Analysis for Genome-Wide Association Studies Using Supervised Principal Components [J].
Chen, Xi ;
Wang, Lily ;
Hu, Bo ;
Guo, Mingsheng ;
Barnard, John ;
Zhu, Xiaofeng .
GENETIC EPIDEMIOLOGY, 2010, 34 (07) :716-724
[8]  
D'Aspremont A, 2004, P NEUR INF PROC SYST
[9]  
Golub G.H., 2012, Matrix Computations
[10]  
Hatcher Larry, 1994, A step-by-step approach to using the SAS system for univariate and multivariate statistics