MultiGeMS: detection of SNVs from multiple samples using model selection on high-throughput sequencing data

被引:3
作者
Murillo, Gabriel H. [1 ]
You, Na [2 ]
Su, Xiaoquan [3 ]
Cui, Wei [1 ]
Reilly, Muredach P. [4 ]
Li, Mingyao [5 ]
Ning, Kang [6 ]
Cui, Xinping [1 ,7 ]
机构
[1] Univ Calif Riverside, Dept Stat, Riverside, CA 92521 USA
[2] Sun Yat Sen Univ, Sch Math & Computat Sci, Dept Stat Sci, Guangzhou 510275, Guangdong, Peoples R China
[3] Chinese Acad Sci, Qingdao Inst BioEnergy & Bioproc Technol, Qingdao 266101, Shandong, Peoples R China
[4] Univ Penn, Perelman Sch Med, Cardiovasc Inst, Philadelphia, PA 19104 USA
[5] Univ Penn, Perelman Sch Med, Dept Biostat & Epidemiol, Philadelphia, PA 19104 USA
[6] Huazhong Univ Sci & Technol, Coll Life Sci & Technol, Key Lab Mol Biophys, Minist Educ, Wuhan 430074, Hubei, Peoples R China
[7] Univ Calif Riverside, Inst Integrat Genome Biol, Ctr Plant Cell Biol, Riverside, CA 92521 USA
基金
美国国家科学基金会;
关键词
DISCOVERY; FRAMEWORK;
D O I
10.1093/bioinformatics/btv753
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Single nucleotide variant (SNV) detection procedures are being utilized as never before to analyze the recent abundance of high-throughput DNA sequencing data, both on single and multiple sample datasets. Building on previously published work with the single sample SNV caller genotype model selection (GeMS), a multiple sample version of GeMS (MultiGeMS) is introduced. Unlike other popular multiple sample SNV callers, the MultiGeMS statistical model accounts for enzymatic substitution sequencing errors. It also addresses the multiple testing problem endemic to multiple sample SNV calling and utilizes high performance computing (HPC) techniques. Results: A simulation study demonstrates that MultiGeMS ranks highest in precision among a selection of popular multiple sample SNV callers, while showing exceptional recall in calling common SNVs. Further, both simulation studies and real data analyses indicate that MultiGeMS is robust to low-quality data. We also demonstrate that accounting for enzymatic substitution sequencing errors not only improves SNV call precision at low mapping quality regions, but also improves recall at reference allele-dominated sites with high mapping quality.
引用
收藏
页码:1486 / 1492
页数:7
相关论文
共 15 条
[1]   Dindel: Accurate indel calls from short-read data [J].
Albers, Cornelis A. ;
Lunter, Gerton ;
MacArthur, Daniel G. ;
McVean, Gilean ;
Ouwehand, Willem H. ;
Durbin, Richard .
GENOME RESEARCH, 2011, 21 (06) :961-973
[2]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[3]   Integrating common and rare genetic variation in diverse human populations [J].
Altshuler, David M. ;
Gibbs, Richard A. ;
Peltonen, Leena ;
Dermitzakis, Emmanouil ;
Schaffner, Stephen F. ;
Yu, Fuli ;
Bonnen, Penelope E. ;
de Bakker, Paul I. W. ;
Deloukas, Panos ;
Gabriel, Stacey B. ;
Gwilliam, Rhian ;
Hunt, Sarah ;
Inouye, Michael ;
Jia, Xiaoming ;
Palotie, Aarno ;
Parkin, Melissa ;
Whittaker, Pamela ;
Chang, Kyle ;
Hawes, Alicia ;
Lewis, Lora R. ;
Ren, Yanru ;
Wheeler, David ;
Muzny, Donna Marie ;
Barnes, Chris ;
Darvishi, Katayoon ;
Hurles, Matthew ;
Korn, Joshua M. ;
Kristiansson, Kati ;
Lee, Charles ;
McCarroll, Steven A. ;
Nemesh, James ;
Keinan, Alon ;
Montgomery, Stephen B. ;
Pollack, Samuela ;
Price, Alkes L. ;
Soranzo, Nicole ;
Gonzaga-Jauregui, Claudia ;
Anttila, Verneri ;
Brodeur, Wendy ;
Daly, Mark J. ;
Leslie, Stephen ;
McVean, Gil ;
Moutsianas, Loukas ;
Nguyen, Huy ;
Zhang, Qingrun ;
Ghori, Mohammed J. R. ;
McGinnis, Ralph ;
McLaren, William ;
Takeuchi, Fumihiko ;
Grossman, Sharon R. .
NATURE, 2010, 467 (7311) :52-58
[4]  
Barnett D., 2011, BIOINFORMATICS, V27
[5]   A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[6]  
GENOMES P, 2012, NATURE, V491, P56, DOI DOI 10.1038/NATURE11632
[7]   VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing [J].
Koboldt, Daniel C. ;
Zhang, Qunyuan ;
Larson, David E. ;
Shen, Dong ;
McLellan, Michael D. ;
Lin, Ling ;
Miller, Christopher A. ;
Mardis, Elaine R. ;
Ding, Li ;
Wilson, Richard K. .
GENOME RESEARCH, 2012, 22 (03) :568-576
[8]   SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples [J].
Le, Si Quang ;
Durbin, Richard .
GENOME RESEARCH, 2011, 21 (06) :952-960
[9]   A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families [J].
Li, Bingshan ;
Chen, Wei ;
Zhan, Xiaowei ;
Busonero, Fabio ;
Sanna, Serena ;
Sidore, Carlo ;
Cucca, Francesco ;
Kang, Hyun M. ;
Abecasis, Goncalo R. .
PLOS GENETICS, 2012, 8 (10)
[10]   The Sequence Alignment/Map format and SAMtools [J].
Li, Heng ;
Handsaker, Bob ;
Wysoker, Alec ;
Fennell, Tim ;
Ruan, Jue ;
Homer, Nils ;
Marth, Gabor ;
Abecasis, Goncalo ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (16) :2078-2079