GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies

被引：33

作者：

Sikorska, Karolina ^{[1
,2
,3
]}

Lesaffre, Emmanuel ^{[1
,4
]}

Groenen, Patrick F. J. ^{[5
]}

Eilers, Paul H. C. ^{[1
]}

机构：

[1] Erasmus MC, Dept Biostat, Rotterdam, Netherlands

[2] Erasmus MC, Dept Internal Med, Rotterdam, Netherlands

[3] Erasmus MC, Dept Epidemiol, Rotterdam, Netherlands

[4] Katholieke Univ Leuven, L Biostat, Louvain, Belgium

[5] Erasmus Univ, Inst Econ, Rotterdam, Netherlands

来源：

BMC BIOINFORMATICS | 2013年 / 14卷

关键词：

TOOL;

D O I：

10.1186/1471-2105-14-166

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Genome-wide association studies have become very popular in identifying genetic contributions to phenotypes. Millions of SNPs are being tested for their association with diseases and traits using linear or logistic regression models. This conceptually simple strategy encounters the following computational issues: a large number of tests and very large genotype files (many Gigabytes) which cannot be directly loaded into the software memory. One of the solutions applied on a grand scale is cluster computing involving large-scale resources. We show how to speed up the computations using matrix operations in pure R code. Results: We improve speed: computation time from 6 hours is reduced to 10-15 minutes. Our approach can handle essentially an unlimited amount of covariates efficiently, using projections. Data files in GWAS are vast and reading them into computer memory becomes an important issue. However, much improvement can be made if the data is structured beforehand in a way allowing for easy access to blocks of SNPs. We propose several solutions based on the R packages ff and ncdf. We adapted the semi-parallel computations for logistic regression. We show that in a typical GWAS setting, where SNP effects are very small, we do not lose any precision and our computations are few hundreds times faster than standard procedures. Conclusions: We provide very fast algorithms for GWAS written in pure R code. We also show how to rearrange SNP data for fast access.

引用

页数：11

共 16 条

[1]

Adler D, 2012, FF MEMORY EFFICIENT

[2]

Agresti A, 2002, CATEGORICAL DATA ANA, V359

[3] GenABEL: an R library for genome-wide association analysis [J].

Aulchenko, Yurii S. ;

Ripke, Stephan ;

Isaacs, Aaron ;

Van Duijn, Cornelia M. .

BIOINFORMATICS, 2007, 23 (10) :1294-1296

[4] ProbABEL package for genome-wide association analysis of imputed data [J].

Aulchenko, Yurii S. ;

Struchalin, Maksim V. ;

van Duijn, Cornelia M. .

BMC BIOINFORMATICS, 2010, 11

[5]

Clayton D., 2012, SNPSTATS SNPMATRIX X

[6] GRIMP: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data [J].

Estrada, Karol ;

Abuseiris, Anis ;

Grosveld, Frank G. ;

Uitterlinden, Andre G. ;

Knoch, Tobias A. ;

Rivadeneira, Fernando .

BIOINFORMATICS, 2009, 25 (20) :2750-2752

[7]

Hindorff L., CATALOG PUBLISHED GE

[8] MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes [J].

Li, Yun ;

Willer, Cristen J. ;

Ding, Jun ;

Scheet, Paul ;

Abecasis, Goncalo R. .

GENETIC EPIDEMIOLOGY, 2010, 34 (08) :816-834

[9] Genotype Imputation [J].

Li, Yun ;

Willer, Cristen ;

Sanna, Serena ;

Abecasis, Goncalo .

ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, 2009, 10 :387-406

[10]

Lippert C, 2011, NAT METHODS, V8, P833, DOI [10.1038/NMETH.1681, 10.1038/nmeth.1681]

← 1 2 →