On Combining Reference Data to Improve Imputation Accuracy

被引:7
作者
Chen, Jun [1 ]
Zhang, Ji-Gang [1 ]
Li, Jian [1 ]
Pei, Yu-Fang [1 ]
Deng, Hong-Wen [1 ,2 ,3 ]
机构
[1] Tulane Univ, Sch Publ Hlth & Trop Med, Ctr Bioinformat & Genom, Dept Biostat & Bioinformat, New Orleans, LA 70118 USA
[2] Shanghai Univ Sci & Technol, Ctr Syst Biomed Sci, Shanghai 201800, Peoples R China
[3] Beijing Jiaotong Univ, Coll Life Sci & Bioengn, Beijing, Peoples R China
来源
PLOS ONE | 2013年 / 8卷 / 01期
基金
美国国家卫生研究院;
关键词
GENOME-WIDE ASSOCIATION; HAPLOTYPE RECONSTRUCTION; GENOTYPE IMPUTATION; INFERENCE; SEQUENCE;
D O I
10.1371/journal.pone.0055600
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS) data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE) for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.
引用
收藏
页数:8
相关论文
共 28 条
  • [21] A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase
    Scheet, P
    Stephens, M
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2006, 78 (04) : 629 - 644
  • [22] A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants
    Scott, Laura J.
    Mohlke, Karen L.
    Bonnycastle, Lori L.
    Willer, Cristen J.
    Li, Yun
    Duren, William L.
    Erdos, Michael R.
    Stringham, Heather M.
    Chines, Peter S.
    Jackson, Anne U.
    Prokunina-Olsson, Ludmila
    Ding, Chia-Jen
    Swift, Amy J.
    Narisu, Narisu
    Hu, Tianle
    Pruim, Randall
    Xiao, Rui
    Li, Xiao-Yi
    Conneely, Karen N.
    Riebow, Nancy L.
    Sprau, Andrew G.
    Tong, Maurine
    White, Peggy P.
    Hetrick, Kurt N.
    Barnhart, Michael W.
    Bark, Craig W.
    Goldstein, Janet L.
    Watkins, Lee
    Xiang, Fang
    Saramies, Jouko
    Buchanan, Thomas A.
    Watanabe, Richard M.
    Valle, Timo T.
    Kinnunen, Leena
    Abecasis, Gonalo R.
    Pugh, Elizabeth W.
    Doheny, Kimberly F.
    Bergman, Richard N.
    Tuomilehto, Jaakko
    Collins, Francis S.
    Boehnke, Michael
    [J]. SCIENCE, 2007, 316 (5829) : 1341 - 1345
  • [23] Imputation-based analysis of association studies: Candidate regions and quantitative traits
    Servin, Bertrand
    Stephens, Matthew
    [J]. PLOS GENETICS, 2007, 3 (07): : 1296 - 1308
  • [24] Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip
    Spencer, Chris C. A.
    Su, Zhan
    Donnelly, Peter
    Marchini, Jonathan
    [J]. PLOS GENETICS, 2009, 5 (05)
  • [25] A comparison of Bayesian methods for haplotype reconstruction from population genotype data
    Stephens, M
    Donnelly, P
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2003, 73 (05) : 1162 - 1169
  • [26] A new statistical method for haplotype reconstruction from population data
    Stephens, M
    Smith, NJ
    Donnelly, P
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2001, 68 (04) : 978 - 989
  • [27] HAPGEN2: simulation of multiple disease SNPs
    Su, Zhan
    Marchini, Jonathan
    Donnelly, Peter
    [J]. BIOINFORMATICS, 2011, 27 (16) : 2304 - 2305
  • [28] Improved imputation of common and uncommon SNPs with a new reference set
    Wang, Zhaoming
    Jacobs, Kevin B.
    Yeager, Meredith
    Hutchinson, Amy
    Sampson, Joshua
    Chatterjee, Nilanjan
    Albanes, Demetrius
    Berndt, Sonja I.
    Chung, Charles C.
    Diver, W. Ryan
    Gapstur, Susan M.
    Teras, Lauren R.
    Haiman, Christopher A.
    Henderson, Brian E.
    Stram, Daniel
    Deng, Xiang
    Hsing, Ann W.
    Virtamo, Jarmo
    Eberle, Michael A.
    Stone, Jennifer L.
    Purdue, Mark P.
    Taylor, Phil
    Tucker, Margaret
    Chanock, Stephen J.
    [J]. NATURE GENETICS, 2012, 44 (01) : 6 - 7