Evaluation of variant identification methods for whole genome sequencing data in dairy cattle

被引:38
作者
Baes, Christine F. [1 ,2 ]
Dolezal, Marlies A. [3 ,4 ]
Koltes, James E. [5 ]
Bapst, Beat [2 ]
Fritz-Waters, Eric [5 ]
Jansen, Sandra [6 ]
Flury, Christine [1 ]
Signer-Hasler, Heidi [1 ]
Stricker, Christian [7 ]
Fernando, Rohan [5 ]
Fries, Ruedi [6 ]
Moll, Juerg [2 ]
Garrick, Dorian J. [5 ]
Reecy, James M. [5 ]
Gredler, Birgit [2 ]
机构
[1] Bern Univ Appl Sci, Sch Agr Forest & Food Sci HAFL, CH-3052 Zollikofen, Switzerland
[2] Qualitas AG, CH-6300 Zug, Switzerland
[3] Univ Milan, Dept VESPA, I-20133 Milan, Italy
[4] Univ Vet Med Vienna, A-1210 Vienna, Austria
[5] Iowa State Univ, Dept Anim Sci, Ames, IA 50011 USA
[6] Tech Univ Munich, D-85354 Freising Weihenstephan, Germany
[7] Agn Genet GmbH, CH-7260 Davos, Switzerland
关键词
Next-generation sequencing analysis; Single nucleotide variant identification; Pipeline; COMPLEX TRAITS; POPULATION; FORMAT; FRAMEWORK; ALIGNMENT; GENOTYPE;
D O I
10.1186/1471-2164-15-948
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays. Results: The total number of SNVs identified varied by software and method, with single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Computing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased computing time considerably. Average concordance for single (multi) sample results with high-density chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNVs was 99.2% (99.2%) across software. The average quality of SNVs identified, measured as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition/transversion ratio. Conclusions: Our findings serve as a reference for variant identification pipeline development in non-human organisms and help assess the implication of preparatory steps in next-generation sequencing pipelines for organisms with incomplete reference genomes (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association studies and genomic selection.
引用
收藏
页数:18
相关论文
共 33 条
[1]  
Acland A, 2013, NUCLEIC ACIDS RES, V41, pD8, DOI [10.1093/nar/gkx1095, 10.1093/nar/gks1189, 10.1093/nar/gkq1172]
[2]   An integrated map of genetic variation from 1,092 human genomes [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Schmidt, Jeanette P. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Dinh, Huyen ;
Kovar, Christie ;
Lee, Sandra ;
Lewis, Lora ;
Muzny, Donna ;
Reid, Jeff ;
Wang, Min ;
Wang, Jun ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Li, Zhuo ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Su, Zhe ;
Tai, Shuaishuai ;
Tang, Meifang .
NATURE, 2012, 491 (7422) :56-65
[3]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[4]   Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals [J].
Cheng, Anthony Youzhi ;
Teo, Yik-Ying ;
Ong, Rick Twee-Hee .
BIOINFORMATICS, 2014, 30 (12) :1707-1713
[5]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[6]   The variant call format and VCFtools [J].
Danecek, Petr ;
Auton, Adam ;
Abecasis, Goncalo ;
Albers, Cornelis A. ;
Banks, Eric ;
DePristo, Mark A. ;
Handsaker, Robert E. ;
Lunter, Gerton ;
Marth, Gabor T. ;
Sherry, Stephen T. ;
McVean, Gilean ;
Durbin, Richard .
BIOINFORMATICS, 2011, 27 (15) :2156-2158
[7]   Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle [J].
Daetwyler, Hans D. ;
Capitan, Aurelien ;
Pausch, Hubert ;
Stothard, Paul ;
Van Binsbergen, Rianne ;
Brondum, Rasmus F. ;
Liao, Xiaoping ;
Djari, Anis ;
Rodriguez, Sabrina C. ;
Grohs, Cecile ;
Esquerre, Diane ;
Bouchez, Olivier ;
Rossignol, Marie-Noelle ;
Klopp, Christophe ;
Rocha, Dominique ;
Fritz, Sebastien ;
Eggen, Andre ;
Bowman, Phil J. ;
Coote, David ;
Chamberlain, Amanda J. ;
Anderson, Charlotte ;
VanTassell, Curt P. ;
Hulsegge, Ina ;
Goddard, Mike E. ;
Guldbrandtsen, Bernt ;
Lund, Mogens S. ;
Veerkamp, Roel F. ;
Boichard, Didier A. ;
Fries, Ruedi ;
Hayes, Ben J. .
NATURE GENETICS, 2014, 46 (08) :858-865
[8]   A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[9]   Genomewide comparison of DNA sequences between humans and chimpanzees [J].
Ebersberger, I ;
Metzler, D ;
Schwarz, C ;
Pääbo, S .
AMERICAN JOURNAL OF HUMAN GENETICS, 2002, 70 (06) :1490-1497
[10]   Genome sequencing and population genomics in non-model organisms [J].
Ellegren, Hans .
TRENDS IN ECOLOGY & EVOLUTION, 2014, 29 (01) :51-63