Sparse latent factor regression models for genome-wide and epigenome-wide association studies

被引:4
作者
Jumentier, Basile [1 ,2 ]
Caye, Kevin [1 ]
Heude, Barbara [3 ]
Lepeule, Johanna [2 ]
Francois, Olivier [1 ,4 ]
机构
[1] Univ Grenoble Alpes, Ctr Natl Rech Sci, Grenoble INP, TIMC,IMAG CNRS UMR 5525, F-38000 Grenoble, France
[2] Univ Grenoble Alpes, Ctr Natl Rech Sci, Inst Natl Sante & Rech Medicate, Inst Adv Biosci,INSERM 1209,CNRS UMR 5309, F-38000 Grenoble, France
[3] Univ Paris, Inst Natl Sante & Rech Medicate, Ctr Res Epidemiol & Stat, INSERM UMR 1153, F-75004 Paris, France
[4] Inria Grenoble, Equipe Statify, Lab Jean Kuntzmann, Rhone Alpes Inovallee 655 Ave Europe CS 90051, F-38334 Montbonnot St Martin, France
关键词
confounding factors; epigenome-wide association; genome-wide association; sparse model; statistical methods; GENE-EXPRESSION; MATRIX; LOCI;
D O I
10.1515/sagmb-2021-0035
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.
引用
收藏
页数:19
相关论文
共 49 条
[1]   Pregnancy exposure to atmospheric pollution and meteorological conditions and placental DNA methylation [J].
Abraham, Emilie ;
Rousseaux, Sophie ;
Agier, Lydiane ;
Giorgis-Allemand, Lise ;
Tost, Jorg ;
Galineau, Julien ;
Hulin, Agnes ;
Siroux, Valerie ;
Vaiman, Daniel ;
Charles, Marie-Aline ;
Heude, Barbara ;
Forhan, Anne ;
Schwartz, Joel ;
Chuffart, Florent ;
Bourova-Flin, Ekaterina ;
Khochbin, Saadi ;
Slama, Remy ;
Lepeule, Johanna .
ENVIRONMENT INTERNATIONAL, 2018, 118 :334-347
[2]   Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines [J].
Atwell, Susanna ;
Huang, Yu S. ;
Vilhjalmsson, Bjarni J. ;
Willems, Glenda ;
Horton, Matthew ;
Li, Yan ;
Meng, Dazhe ;
Platt, Alexander ;
Tarone, Aaron M. ;
Hu, Tina T. ;
Jiang, Rong ;
Muliyati, N. Wayan ;
Zhang, Xu ;
Amer, Muhammad Ali ;
Baxter, Ivan ;
Brachi, Benjamin ;
Chory, Joanne ;
Dean, Caroline ;
Debieu, Marilyne ;
de Meaux, Juliette ;
Ecker, Joseph R. ;
Faure, Nathalie ;
Kniskern, Joel M. ;
Jones, Jonathan D. G. ;
Michael, Todd ;
Nemri, Adnane ;
Roux, Fabrice ;
Salt, David E. ;
Tang, Chunlao ;
Todesco, Marco ;
Traw, M. Brian ;
Weigel, Detlef ;
Marjoram, Paul ;
Borevitz, Justin O. ;
Bergelson, Joy ;
Nordborg, Magnus .
NATURE, 2010, 465 (7298) :627-631
[3]   A tutorial on statistical methods for population association studies [J].
Balding, David J. .
NATURE REVIEWS GENETICS, 2006, 7 (10) :781-791
[4]  
Battram T., 2021, EWAS CATALOG DATABAS, DOI [10.31219/osf.io/837wn, DOI 10.31219/OSF.IO/837WN]
[5]  
Bertsekas D. P, 1997, J. Oper. Res. Soc., V48, P334, DOI [DOI 10.1057/PALGRAVE.JORS.2600425, 10.1057/palgrave.jors.2600425]
[6]   REMARKS ON PARALLEL ANALYSIS [J].
BUJA, A ;
EYUBOGLU, N .
MULTIVARIATE BEHAVIORAL RESEARCH, 1992, 27 (04) :509-540
[7]   The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 [J].
Buniello, Annalisa ;
MacArthur, Jacqueline A. L. ;
Cerezo, Maria ;
Harris, Laura W. ;
Hayhurst, James ;
Malangone, Cinzia ;
McMahon, Aoife ;
Morales, Joannella ;
Mountjoy, Edward ;
Sollis, Elliot ;
Suveges, Daniel ;
Vrousgou, Olga ;
Whetzel, Patricia L. ;
Amode, Ridwan ;
Guillen, Jose A. ;
Riat, Harpreet S. ;
Trevanion, Stephen J. ;
Hall, Peggy ;
Junkins, Heather ;
Flicek, Paul ;
Burdett, Tony ;
Hindorff, Lucia A. ;
Cunningham, Fiona ;
Parkinson, Helen .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D1005-D1012
[8]   Arabidopsis STERILE APETALA, a multifunctional gene regulating inflorescence, flower, and ovule development [J].
Byzova, MV ;
Franken, J ;
Aarts, MGM ;
de Almeida-Engler, J ;
Engler, G ;
Mariani, C ;
Campagne, MMV ;
Angenent, GC .
GENES & DEVELOPMENT, 1999, 13 (08) :1002-1014
[9]   A SINGULAR VALUE THRESHOLDING ALGORITHM FOR MATRIX COMPLETION [J].
Cai, Jian-Feng ;
Candes, Emmanuel J. ;
Shen, Zuowei .
SIAM JOURNAL ON OPTIMIZATION, 2010, 20 (04) :1956-1982
[10]   Mediation by Placental DNA Methylation of the Association of Prenatal Maternal Smoking and Birth Weight [J].
Cardenas, Andres ;
Lutz, Sharon M. ;
Everson, Todd M. ;
Perron, Patrice ;
Bouchard, Luigi ;
Hivert, AndMarie-France .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2019, 188 (11) :1878-1886