SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

被引:8
作者
Zhang, Di [1 ]
Zhao, Linhai [1 ]
Li, Biao [1 ]
He, Zongxiao [1 ]
Wang, Gao T. [2 ]
Liu, Dajiang J. [3 ]
Leal, Suzanne M. [1 ]
机构
[1] Baylor Coll Med, Dept Mol & Human Genet, Ctr Stat Genet, Houston, TX 77030 USA
[2] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
[3] Penn State Univ, Coll Med, Dept Publ Hlth Sci, Hershey, PA 17033 USA
关键词
GENERAL FRAMEWORK; GENETIC-VARIATION; WIDE ASSOCIATION; PARTICIPANTS; PROJECT; DISEASE; HEALTH;
D O I
10.1016/j.ajhg.2017.05.017
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 x 10(-6)) was observed with CCDC62 (SKAT-O [p = 6.89 x 10(-7)], combined multivariate collapsing [p = 1.48 x 10(-6)], and burden of rare variants [p = 1.48 x 10(-6)]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.
引用
收藏
页码:115 / 122
页数:8
相关论文
共 23 条
  • [1] Guidelines for Large-Scale Sequence-Based Complex Trait Association Studies: Lessons Learned from the NHLBI Exome Sequencing Project
    Auer, Paul L.
    Reiner, Alex P.
    Wang, Gao
    Kang, Hyun Min
    Abecasis, Goncalo R.
    Altshuler, David
    Bamshad, Michael J.
    Nickerson, Deborah A.
    Tracy, Russell P.
    Rich, Stephen S.
    Leal, Suzanne M.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2016, 99 (04) : 791 - 801
  • [2] Testing for Rare Variant Associations in the Presence of Missing Data
    Auer, Paul L.
    Wang, Gao
    Leal, Suzanne M.
    [J]. GENETIC EPIDEMIOLOGY, 2013, 37 (06) : 529 - 538
  • [3] A general framework for estimating the relative pathogenicity of human genetic variants
    Kircher, Martin
    Witten, Daniela M.
    Jain, Preti
    O'Roak, Brian J.
    Cooper, Gregory M.
    Shendure, Jay
    [J]. NATURE GENETICS, 2014, 46 (03) : 310 - +
  • [4] Optimal tests for rare variant effects in sequencing association studies
    Lee, Seunggeun
    Wu, Michael C.
    Lin, Xihong
    [J]. BIOSTATISTICS, 2012, 13 (04) : 762 - 775
  • [5] Analysis of protein-coding genetic variation in 60,706 humans
    Lek, Monkol
    Karczewski, Konrad J.
    Minikel, Eric V.
    Samocha, Kaitlin E.
    Banks, Eric
    Fennell, Timothy
    O'Donnell-Luria, Anne H.
    Ware, James S.
    Hill, Andrew J.
    Cummings, Beryl B.
    Tukiainen, Taru
    Birnbaum, Daniel P.
    Kosmicki, Jack A.
    Duncan, Laramie E.
    Estrada, Karol
    Zhao, Fengmei
    Zou, James
    Pierce-Hollman, Emma
    Berghout, Joanne
    Cooper, David N.
    Deflaux, Nicole
    DePristo, Mark
    Do, Ron
    Flannick, Jason
    Fromer, Menachem
    Gauthier, Laura
    Goldstein, Jackie
    Gupta, Namrata
    Howrigan, Daniel
    Kiezun, Adam
    Kurki, Mitja I.
    Moonshine, Ami Levy
    Natarajan, Pradeep
    Orozeo, Lorena
    Peloso, Gina M.
    Poplin, Ryan
    Rivas, Manuel A.
    Ruano-Rubio, Valentin
    Rose, Samuel A.
    Ruderfer, Douglas M.
    Shakir, Khalid
    Stenson, Peter D.
    Stevens, Christine
    Thomas, Brett P.
    Tiao, Grace
    Tusie-Luna, Maria T.
    Weisburd, Ben
    Won, Hong-Hee
    Yu, Dongmei
    Altshuler, David M.
    [J]. NATURE, 2016, 536 (7616) : 285 - +
  • [6] Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data
    Li, Bingshan
    Leal, Suzanne M.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2008, 83 (03) : 311 - 321
  • [7] Genotype Imputation
    Li, Yun
    Willer, Cristen
    Sanna, Serena
    Abecasis, Goncalo
    [J]. ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, 2009, 10 : 387 - 406
  • [8] Lin DY, 2011, AM J HUM GENET, V89, P354, DOI [10.1016/j.ajhg.2011.07.015, 10.1016/j.ajhg.2011.07.015.]
  • [9] Meta-analysis of gene- level tests for rare variant association
    Liu, Dajiang J.
    Peloso, Gina M.
    Zhan, Xiaowei
    Holmen, Oddgeir L.
    Zawistowski, Matthew
    Feng, Shuang
    Nikpay, Majid
    Auer, Paul L.
    Goel, Anuj
    Zhang, He
    Peters, Ulrike
    Farrall, Martin
    Orho-Melander, Marju
    Kooperberg, Charles
    McPherson, Ruth
    Watkins, Hugh
    Willer, Cristen J.
    Hveem, Kristian
    Melander, Olle
    Kathiresan, Sekar
    Abecasis, Goncalo R.
    [J]. NATURE GENETICS, 2014, 46 (02) : 200 - +
  • [10] dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs
    Liu, Xiaoming
    Wu, Chunlei
    Li, Chang
    Boerwinkle, Eric
    [J]. HUMAN MUTATION, 2016, 37 (03) : 235 - 241