Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis

被引：8

作者：

Zhou, Yan ^{[1
]}

Wang, Pei ^{[2
]}

Wang, Xianlong ^{[3
]}

Zhu, Ji ^{[4
]}

Song, Peter X. -K. ^{[4
]}

机构：

[1] Merck & Co Inc, N Wales, PA USA

[2] Icahn Sch Med Mt Sinai, New York, NY 10029 USA

[3] Fred Hutchinson Canc Res Ctr, 1124 Columbia St, Seattle, WA 98104 USA

[4] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

GENETIC EPIDEMIOLOGY | 2017年 / 41卷 / 01期

基金：

美国国家卫生研究院; 美国国家科学基金会;

关键词：

EM-blockwise coordinate descent; high-dimensional data; latent factors; regularization; COPY NUMBER ALTERATIONS; GENE-EXPRESSION; SELECTION; REVEALS; TARGET;

D O I：

10.1002/gepi.22018

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodologysparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response-predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans-hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32-33, which is associated with chemoresistance in ovarian cancer.

引用

页码：70 / 80

页数：11

共 46 条

[41] Simultaneous variable selection [J].