On Combining Reference Data to Improve Imputation Accuracy

被引:7
作者
Chen, Jun [1 ]
Zhang, Ji-Gang [1 ]
Li, Jian [1 ]
Pei, Yu-Fang [1 ]
Deng, Hong-Wen [1 ,2 ,3 ]
机构
[1] Tulane Univ, Sch Publ Hlth & Trop Med, Ctr Bioinformat & Genom, Dept Biostat & Bioinformat, New Orleans, LA 70118 USA
[2] Shanghai Univ Sci & Technol, Ctr Syst Biomed Sci, Shanghai 201800, Peoples R China
[3] Beijing Jiaotong Univ, Coll Life Sci & Bioengn, Beijing, Peoples R China
来源
PLOS ONE | 2013年 / 8卷 / 01期
基金
美国国家卫生研究院;
关键词
GENOME-WIDE ASSOCIATION; HAPLOTYPE RECONSTRUCTION; GENOTYPE IMPUTATION; INFERENCE; SEQUENCE;
D O I
10.1371/journal.pone.0055600
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS) data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE) for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.
引用
收藏
页数:8
相关论文
共 28 条
  • [1] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [2] Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering
    Browning, Sharon R.
    Browning, Brian L.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (05) : 1084 - 1097
  • [3] Haplotype phasing: existing methods and new developments
    Browning, Sharon R.
    Browning, Brian L.
    [J]. NATURE REVIEWS GENETICS, 2011, 12 (10) : 703 - 714
  • [4] High-Resolution Detection of Identity by Descent in Unrelated Individuals
    Browning, Sharon R.
    Browning, Brian L.
    [J]. AMERICAN JOURNAL OF HUMAN GENETICS, 2010, 86 (04) : 526 - 539
  • [5] De novo fragment assembly with short mate-paired reads: Does the read length matter?
    Chaisson, Mark J.
    Brinza, Dumitru
    Pevzner, Pavel A.
    [J]. GENOME RESEARCH, 2009, 19 (02) : 336 - 346
  • [6] A second generation human haplotype map of over 3.1 million SNPs
    Frazer, Kelly A.
    Ballinger, Dennis G.
    Cox, David R.
    Hinds, David A.
    Stuve, Laura L.
    Gibbs, Richard A.
    Belmont, John W.
    Boudreau, Andrew
    Hardenbol, Paul
    Leal, Suzanne M.
    Pasternak, Shiran
    Wheeler, David A.
    Willis, Thomas D.
    Yu, Fuli
    Yang, Huanming
    Zeng, Changqing
    Gao, Yang
    Hu, Haoran
    Hu, Weitao
    Li, Chaohua
    Lin, Wei
    Liu, Siqi
    Pan, Hao
    Tang, Xiaoli
    Wang, Jian
    Wang, Wei
    Yu, Jun
    Zhang, Bo
    Zhang, Qingrun
    Zhao, Hongbin
    Zhao, Hui
    Zhou, Jun
    Gabriel, Stacey B.
    Barry, Rachel
    Blumenstiel, Brendan
    Camargo, Amy
    Defelice, Matthew
    Faggart, Maura
    Goyette, Mary
    Gupta, Supriya
    Moore, Jamie
    Nguyen, Huy
    Onofrio, Robert C.
    Parkin, Melissa
    Roy, Jessica
    Stahl, Erich
    Winchester, Ellen
    Ziaugra, Liuda
    Altshuler, David
    Shen, Yan
    [J]. NATURE, 2007, 449 (7164) : 851 - U3
  • [7] Accounting for bias from sequencing error in population genetic estimates
    Johnson, Philip L. F.
    Slatkin, Montgomery
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2008, 25 (01) : 199 - 206
  • [8] Performance of Genotype Imputation for Rare Variants Identified in Exons and Flanking Regions of Genes
    Li, Li
    Li, Yun
    Browning, Sharon R.
    Browning, Brian L.
    Slater, Andrew J.
    Kong, Xiangyang
    Aponte, Jennifer L.
    Mooser, Vincent E.
    Chissoe, Stephanie L.
    Whittaker, John C.
    Nelson, Matthew R.
    Ehm, Margaret Gelder
    [J]. PLOS ONE, 2011, 6 (09):
  • [9] Low-coverage sequencing: Implications for design of complex trait association studies
    Li, Yun
    Sidore, Carlo
    Kang, Hyun Min
    Boehnke, Michael
    Abecasis, Goncalo R.
    [J]. GENOME RESEARCH, 2011, 21 (06) : 940 - 951
  • [10] MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes
    Li, Yun
    Willer, Cristen J.
    Ding, Jun
    Scheet, Paul
    Abecasis, Goncalo R.
    [J]. GENETIC EPIDEMIOLOGY, 2010, 34 (08) : 816 - 834