Highly Accurate and Efficient Data-Driven Methods for Genotype Imputation

被引:6
作者
Choudhury, Olivia [1 ]
Chakrabarty, Ankush [2 ]
Emrich, Scott J. [1 ]
机构
[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[2] Harvard Univ, Harvard John A Paulson Sch Engn & Appl Sci, Cambridge, MA 02138 USA
关键词
Genotype imputation; single nucleotide polymorphisms (SNPs); next-generation and high-throughput sequencing; machine learning; big data; GENOME-WIDE ASSOCIATION; LINKAGE DISEQUILIBRIUM; MISSING GENOTYPES; HAPLOTYPES; SEQUENCE; MACHINE; LOCI;
D O I
10.1109/TCBB.2017.2708701
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
High-throughput sequencing techniques have generated massive quantities of genotype data. Haplotype phasing has proven to be a useful and effective method for analyzing these data. However, the quality of phasing is undermined due to missing information. Imputation provides an effective means of improving the underlying genotype information. For model organisms, imputation can rely on an available reference genotype panel and a physical or genetic map. For non-model organisms, which often do not have a genotype panel, it is important to design an imputation technique that does not rely on reference data. Here, we present Accurate Data-Driven Imputation Technique (ADDIT), which is composed of two data-driven algorithms capable of handling data generated from model and non-model organisms. The non-model variant of ADDIT (referred to as ADDIT-NM) employs statistical inference methods to impute missing genotypes, whereas the model variant (referred to as ADDIT-M) leverages a supervised learning-based approach for imputation. We demonstrate that both variants of ADDIT are more accurate, faster, and require less memory than leading state-of-the-art imputation tools using model (human) and non-model (maize, apple, and grape) genotype data.
引用
收藏
页码:1107 / 1116
页数:10
相关论文
共 39 条
[1]  
Adam-Blondon A. F., 2011, Genetics, genomics, and breeding of grapes, P211
[2]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[3]  
Aly M., 2005, Survey on multiclass classification methods, V19, P1
[4]   Genotype Imputation with Millions of Reference Samples [J].
Browning, Brian L. ;
Browning, Sharon R. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2016, 98 (01) :116-126
[5]   A Fast, Powerful Method for Detecting Identity by Descent [J].
Browning, Brian L. ;
Browning, Sharon R. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2011, 88 (02) :173-182
[6]   Haplotype phasing: existing methods and new developments [J].
Browning, Sharon R. ;
Browning, Brian L. .
NATURE REVIEWS GENETICS, 2011, 12 (10) :703-714
[7]   HAPI-Gen: Highly Accurate Phasing and Imputation of Genotype Data [J].
Choudhury, Olivia ;
Chakrabarty, Ankush ;
Emrich, Scott J. .
PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2016, :78-87
[8]  
CLARK AG, 1990, MOL BIOL EVOL, V7, P111
[9]   Imputation of Missing Genotypes From Sparse to High Density Using Long-Range Phasing [J].
Daetwyler, Hans D. ;
Wiggans, George R. ;
Hayes, Ben J. ;
Woolliams, John A. ;
Goddard, Mike E. .
GENETICS, 2011, 189 (01) :317-U1028
[10]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38