Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality

被引:61
作者
Zuvich, Rebecca L. [2 ]
Armstrong, Loren L. [3 ]
Bielinski, Suzette J. [4 ]
Bradford, Yuki [2 ]
Carlson, Christopher S. [5 ]
Crawford, Dana C. [2 ]
Crenshaw, Andrew T. [6 ]
de Andrade, Mariza [7 ]
Doheny, Kimberly F. [8 ]
Haines, Jonathan L. [2 ]
Hayes, M. Geoffrey [3 ]
Jarvik, Gail P. [9 ,10 ]
Jiang, Lan [2 ]
Kullo, Iftikhar J. [11 ]
Li, Rongling [12 ]
Ling, Hua [8 ]
Manolio, Teri A. [12 ]
Matsumoto, Martha E. [7 ]
McCarty, Catherine A. [13 ]
McDavid, Andrew N. [5 ]
Mirel, Daniel B. [6 ]
Olson, Lana M. [2 ]
Paschall, Justin E. [14 ]
Pugh, Elizabeth W. [8 ]
Rasmussen, Luke V. [15 ]
Rasmussen-Torvik, Laura J. [16 ]
Turner, Stephen D. [2 ]
Wilke, Russell A. [17 ]
Ritchie, Marylyn D. [1 ]
机构
[1] Penn State Univ, Huck Inst Life Sci, Ctr Syst Genom, Dept Biochem & Mol Biol, University Pk, PA 16802 USA
[2] Vanderbilt Univ, Dept Mol Physiol & Biophys, Ctr Human Genet Res, Nashville, TN 37232 USA
[3] Northwestern Univ, Div Endocrinol Metab & Mol Med, Feinberg Sch Med, Chicago, IL 60611 USA
[4] Mayo Clin, Dept Hlth Sci Res, Div Epidemiol, Rochester, MN USA
[5] Fred Hutchinson Canc Res Ctr, Seattle, WA 98104 USA
[6] Broad Inst, Genet Anal Platform & Program Med & Populat Genet, Cambridge, MA USA
[7] Mayo Clin, Div Biomed Stat & Informat, Dept Hlth Sci Res, Rochester, MN USA
[8] Johns Hopkins Univ, Ctr Inherited Dis Res, Baltimore, MD USA
[9] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
[10] Univ Washington, Dept Med, Seattle, WA 98195 USA
[11] Mayo Clin, Div Cardiovasc Dis, Dept Med, Rochester, MN USA
[12] NHGRI, Off Populat Genom, NIH, Bethesda, MD 20892 USA
[13] Marshfield Clin Res Fdn, Ctr Human Genet, Marshfield, WI USA
[14] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20892 USA
[15] Marshfield Clin Res Fdn, Biomed Informat Res Ctr, Marshfield, WI USA
[16] Northwestern Univ, Dept Prevent Med, Feinberg Sch Med, Chicago, IL 60611 USA
[17] Vanderbilt Univ, Div Clin Pharmacol, Dept Med, Nashville, TN USA
关键词
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets; ELECTRONIC MEDICAL-RECORDS; GENOME-WIDE ASSOCIATION; HUMAN-DISEASE; ARCHITECTURE; GENETICS; TOOL;
D O I
10.1002/gepi.20639
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotypephenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process. Genet. Epidemiol. 35:887898, 2011. (C) 2011 Wiley Periodicals, Inc.
引用
收藏
页码:887 / 898
页数:12
相关论文
共 28 条
[1]   RETRACTED: Editorial Expression of Concern (Retracted article. See vol. 333, pg. 404, 2011) [J].
Alberts, Bruce .
SCIENCE, 2010, 330 (6006) :912-912
[2]   A tutorial on statistical methods for population association studies [J].
Balding, David J. .
NATURE REVIEWS GENETICS, 2006, 7 (10) :781-791
[3]   Identification of Genomic Predictors of Atrioventricular Conduction Using Electronic Medical Records as a Tool for Genome Science [J].
Denny, Joshua C. ;
Ritchie, Marylyn D. ;
Crawford, Dana C. ;
Schildcrout, Jonathan S. ;
Ramirez, Andrea H. ;
Pulley, Jill M. ;
Basford, Melissa A. ;
Masys, Daniel R. ;
Haines, Jonathan L. ;
Roden, Dan M. .
CIRCULATION, 2010, 122 (20) :2016-2021
[4]   Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records [J].
Dumitrescu, Logan ;
Ritchie, Marylyn D. ;
Brown-Gentry, Kristin ;
Pulley, Jill M. ;
Basford, Melissa ;
Denny, Joshua C. ;
Oksenberg, Jorge R. ;
Roden, Dan M. ;
Haines, Jonathan L. ;
Crawford, Dana C. .
GENETICS IN MEDICINE, 2010, 12 (10) :648-650
[5]   Postassociation Cleaning Using Linkage Disequilibrium Information [J].
Han, Buhm ;
Hackel, Brian M. ;
Eskin, Eleazar .
GENETIC EPIDEMIOLOGY, 2011, 35 (01) :1-10
[6]   Complement factor H polymorphism in age-related macular degeneration [J].
Klein, RJ ;
Zeiss, C ;
Chew, EY ;
Tsai, JY ;
Sackler, RS ;
Haynes, C ;
Henning, AK ;
SanGiovanni, JP ;
Mane, SM ;
Mayne, ST ;
Bracken, MB ;
Ferris, FL ;
Ott, J ;
Barnstable, C ;
Hoh, J .
SCIENCE, 2005, 308 (5720) :385-389
[7]   A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record [J].
Kullo, Iftikhar J. ;
Ding, Keyue ;
Jouni, Hayan ;
Smith, Carin Y. ;
Chute, Christopher G. .
PLOS ONE, 2010, 5 (09)
[8]   Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease [J].
Kullo, Iftikhar J. ;
Fan, Jin ;
Pathak, Jyotishman ;
Sayoya, Guergana K. ;
Ali, Zeenat ;
Chute, Christopher G. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (05) :568-574
[9]   Quality Control and Quality Assurance in Genotypic Data for Genome-Wide Association Studies [J].
Laurie, Cathy C. ;
Doheny, Kimberly F. ;
Mirel, Daniel B. ;
Pugh, Elizabeth W. ;
Bierut, Laura J. ;
Bhangale, Tushar ;
Boehm, Frederick ;
Caporaso, Neil E. ;
Cornelis, Marilyn C. ;
Edenberg, Howard J. ;
Gabriel, Stacy B. ;
Harris, Emily L. ;
Hu, Frank B. ;
Jacobs, Kevin B. ;
Kraft, Peter ;
Landi, Maria Teresa ;
Lumley, Thomas ;
Manolio, Teri A. ;
McHugh, Caitlin ;
Painter, Ian ;
Paschall, Justin ;
Rice, John P. ;
Rice, Kenneth M. ;
Zheng, Xiuwen ;
Weir, Bruce S. .
GENETIC EPIDEMIOLOGY, 2010, 34 (06) :591-602
[10]   The NCBI dbGaP database of genotypes and phenotypes [J].
Mailman, Matthew D. ;
Feolo, Michael ;
Jin, Yumi ;
Kimura, Masato ;
Tryka, Kimberly ;
Bagoutdinov, Rinat ;
Hao, Luning ;
Kiang, Anne ;
Paschall, Justin ;
Phan, Lon ;
Popova, Natalia ;
Pretel, Stephanie ;
Ziyabari, Lora ;
Lee, Moira ;
Shao, Yu ;
Wang, Zhen Y. ;
Sirotkin, Karl ;
Ward, Minghong ;
Kholodov, Michael ;
Zbicz, Kerry ;
Beck, Jeffrey ;
Kimelman, Michael ;
Shevelev, Sergey ;
Preuss, Don ;
Yaschenko, Eugene ;
Graeff, Alan ;
Ostell, James ;
Sherry, Stephen T. .
NATURE GENETICS, 2007, 39 (10) :1181-1186