Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies

被引:2
作者
Wang, Haohan [1 ]
Aragam, Bryon [2 ]
Xing, Eric P. [2 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Language Technol Inst, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Sch Comp Sci, Machine Learning Dept, Pittsburgh, PA 15213 USA
基金
美国安德鲁·梅隆基金会; 美国国家卫生研究院;
关键词
Variable selection; Genome-wide association study; Mixed model; Heterogeneity; Confounding correction; ALZHEIMERS-DISEASE; POPULATION-STRUCTURE; GENETIC ASSOCIATION; LASSO; REGRESSION; VARIANTS; POWER;
D O I
10.1016/j.ymeth.2018.04.021
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naively applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.
引用
收藏
页码:2 / 9
页数:8
相关论文
共 44 条
[1]   Source verification of mis-identified Arabidopsis thaliana accessions [J].
Anastasio, Alison E. ;
Platt, Alexander ;
Horton, Matthew ;
Grotewold, Erich ;
Scholl, Randy ;
Borevitz, Justin O. ;
Nordborg, Magnus ;
Bergelson, Joy .
PLANT JOURNAL, 2011, 67 (03) :554-566
[2]   Population Structure and Cryptic Relatedness in Genetic Association Studies [J].
Astle, William ;
Balding, David J. .
STATISTICAL SCIENCE, 2009, 24 (04) :451-471
[3]   Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines [J].
Atwell, Susanna ;
Huang, Yu S. ;
Vilhjalmsson, Bjarni J. ;
Willems, Glenda ;
Horton, Matthew ;
Li, Yan ;
Meng, Dazhe ;
Platt, Alexander ;
Tarone, Aaron M. ;
Hu, Tina T. ;
Jiang, Rong ;
Muliyati, N. Wayan ;
Zhang, Xu ;
Amer, Muhammad Ali ;
Baxter, Ivan ;
Brachi, Benjamin ;
Chory, Joanne ;
Dean, Caroline ;
Debieu, Marilyne ;
de Meaux, Juliette ;
Ecker, Joseph R. ;
Faure, Nathalie ;
Kniskern, Joel M. ;
Jones, Jonathan D. G. ;
Michael, Todd ;
Nemri, Adnane ;
Roux, Fabrice ;
Salt, David E. ;
Tang, Chunlao ;
Todesco, Marco ;
Traw, M. Brian ;
Weigel, Detlef ;
Marjoram, Paul ;
Borevitz, Justin O. ;
Bergelson, Joy ;
Nordborg, Magnus .
NATURE, 2010, 465 (7298) :627-631
[4]   Joint Variable Selection for Fixed and Random Effects in Linear Mixed-Effects Models [J].
Bondell, Howard D. ;
Krishna, Arun ;
Ghosh, Sujit K. .
BIOMETRICS, 2010, 66 (04) :1069-1077
[5]   Increased apolipoprotein B serum concentration in Alzheimer's disease [J].
Caramelli, P ;
Nitrini, R ;
Maranhao, R ;
Lourenço, ACG ;
Damasceno, MC ;
Vinagre, C ;
Caramelli, B .
ACTA NEUROLOGICA SCANDINAVICA, 1999, 100 (01) :61-63
[6]   Variable selection for multiply-imputed data with application to dioxin exposure study [J].
Chen, Qixuan ;
Wang, Sijian .
STATISTICS IN MEDICINE, 2013, 32 (21) :3646-3659
[7]   Variable selection via nonconcave penalized likelihood and its oracle properties [J].
Fan, JQ ;
Li, RZ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360
[8]   VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS [J].
Fan, Yingying ;
Li, Runze .
ANNALS OF STATISTICS, 2012, 40 (04) :2043-2068
[9]   Genomic selection: prediction of accuracy and maximisation of long term response [J].
Goddard, Mike .
GENETICA, 2009, 136 (02) :245-257
[10]   GRAF1a is a brain-specific protein that promotes lipid droplet clustering and growth, and is enriched at lipid droplet junctions [J].
Haesler, Safa Lucken-Ardjomande ;
Vallis, Yvonne ;
Jolin, Helen E. ;
McKenzie, Andrew N. ;
McMahon, Harvey T. .
JOURNAL OF CELL SCIENCE, 2014, 127 (21) :4602-4619