Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis

被引:8
作者
Zhou, Yan [1 ]
Wang, Pei [2 ]
Wang, Xianlong [3 ]
Zhu, Ji [4 ]
Song, Peter X. -K. [4 ]
机构
[1] Merck & Co Inc, N Wales, PA USA
[2] Icahn Sch Med Mt Sinai, New York, NY 10029 USA
[3] Fred Hutchinson Canc Res Ctr, 1124 Columbia St, Seattle, WA 98104 USA
[4] Univ Michigan, Ann Arbor, MI 48109 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
EM-blockwise coordinate descent; high-dimensional data; latent factors; regularization; COPY NUMBER ALTERATIONS; GENE-EXPRESSION; SELECTION; REVEALS; TARGET;
D O I
10.1002/gepi.22018
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodologysparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response-predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans-hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32-33, which is associated with chemoresistance in ovarian cancer.
引用
收藏
页码:70 / 80
页数:11
相关论文
共 46 条
[41]   Simultaneous variable selection [J].
Turlach, BA ;
Venables, WN ;
Wright, SJ .
TECHNOMETRICS, 2005, 47 (03) :349-363
[42]   Exploratory Factor Analysis of Pathway Copy Number Data with an Application Towards the Integration with Gene Expression Data [J].
Van Wieringen, Wessel N. ;
Van De Wiel, Mark A. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2011, 18 (05) :729-741
[43]  
Wang P., 2010, STAT METHODS CGH ARR
[44]   A Sparse Regulatory Network of Copy-Number Driven Gene Expression Reveals Putative Breast Cancer Oncogenes [J].
Yuan, Yinyin ;
Curtis, Christina ;
Caldas, Carlos ;
Markowetz, Florian .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (04) :947-954
[46]   Regularization and variable selection via the elastic net [J].
Zou, H ;
Hastie, T .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2005, 67 :301-320