Impact of pre-imputation SNP-filtering on genotype imputation results

被引:34
作者
Roshyara, Nab Raj [1 ,2 ]
Kirsten, Holger [1 ,2 ,3 ,4 ]
Horn, Katrin [1 ,2 ]
Ahnert, Peter [1 ,2 ]
Scholz, Markus [1 ,2 ]
机构
[1] Univ Leipzig, Inst Med Informat Stat & Epidemiol, D-04107 Leipzig, Germany
[2] Univ Leipzig, LIFE Ctr, Leipzig Interdisciplinary Res Cluster Genet Facto, D-04103 Leipzig, Germany
[3] Fraunhofer Inst Cell Therapy & Immunol, Dept Cell Therapy, D-04103 Leipzig, Germany
[4] Univ Leipzig, Translat Ctr Regenerat Med, D-04103 Leipzig, Germany
关键词
Genotype imputation; Pre-imputation filtering; SNP quality control; Genome-wide association analysis; SNP data; GENOME-WIDE ASSOCIATION; QUALITY-CONTROL; INFERENCE; GENETICS;
D O I
10.1186/s12863-014-0088-5
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background: Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results: We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion: Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time.
引用
收藏
页数:11
相关论文
共 33 条
[1]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[2]   Data quality control in genetic case-control association studies [J].
Anderson, Carl A. ;
Pettersson, Fredrik H. ;
Clarke, Geraldine M. ;
Cardon, Lon R. ;
Morris, Andrew P. ;
Zondervan, Krina T. .
NATURE PROTOCOLS, 2010, 5 (09) :1564-1573
[3]  
[Anonymous], 1943, Bull Calcutta Math Soc, DOI DOI 10.1038/157869B0
[4]  
Barnes MR, 2010, METHODS MOL BIOL, V628, P1, DOI 10.1007/978-1-60327-367-1
[5]   Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering [J].
Browning, Sharon R. ;
Browning, Brian L. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (05) :1084-1097
[6]   Haplotype phasing: existing methods and new developments [J].
Browning, Sharon R. ;
Browning, Brian L. .
NATURE REVIEWS GENETICS, 2011, 12 (10) :703-714
[7]   Missing data imputation and haplotype phase inference for genome-wide association studies [J].
Browning, Sharon R. .
HUMAN GENETICS, 2008, 124 (05) :439-450
[8]   Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls [J].
Burton, Paul R. ;
Clayton, David G. ;
Cardon, Lon R. ;
Craddock, Nick ;
Deloukas, Panos ;
Duncanson, Audrey ;
Kwiatkowski, Dominic P. ;
McCarthy, Mark I. ;
Ouwehand, Willem H. ;
Samani, Nilesh J. ;
Todd, John A. ;
Donnelly, Peter ;
Barrett, Jeffrey C. ;
Davison, Dan ;
Easton, Doug ;
Evans, David ;
Leung, Hin-Tak ;
Marchini, Jonathan L. ;
Morris, Andrew P. ;
Spencer, Chris C. A. ;
Tobin, Martin D. ;
Attwood, Antony P. ;
Boorman, James P. ;
Cant, Barbara ;
Everson, Ursula ;
Hussey, Judith M. ;
Jolley, Jennifer D. ;
Knight, Alexandra S. ;
Koch, Kerstin ;
Meech, Elizabeth ;
Nutland, Sarah ;
Prowse, Christopher V. ;
Stevens, Helen E. ;
Taylor, Niall C. ;
Walters, Graham R. ;
Walker, Neil M. ;
Watkins, Nicholas A. ;
Winzer, Thilo ;
Jones, Richard W. ;
McArdle, Wendy L. ;
Ring, Susan M. ;
Strachan, David P. ;
Pembrey, Marcus ;
Breen, Gerome ;
St Clair, David ;
Caesar, Sian ;
Gordon-Smith, Katherine ;
Jones, Lisa ;
Fraser, Christine ;
Green, Elain K. .
NATURE, 2007, 447 (7145) :661-678
[9]   MEASURES OF DISTANCE BETWEEN PROBABILITY-DISTRIBUTIONS [J].
CHUNG, JK ;
KANNAPPAN, PL ;
NG, CT ;
SAHOO, PK .
JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1989, 138 (01) :280-292
[10]   Conjuring SNPs to detect associations [J].
Clark, Andrew G. ;
Li, Jian .
NATURE GENETICS, 2007, 39 (07) :815-816