Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines

被引:77
作者
Bush, Stephen J. [1 ,2 ,3 ]
Foster, Dona [1 ,3 ,4 ]
Eyre, David W. [1 ]
Clark, Emily L. [5 ,6 ]
De Maio, Nicola [7 ]
Shaw, Liam P. [1 ]
Stoesser, Nicole [1 ]
Peto, Tim E. A. [1 ,2 ,3 ,4 ]
Crook, Derrick W. [1 ,2 ,3 ,4 ]
Walker, A. Sarah [1 ,2 ,3 ,4 ]
机构
[1] Univ Oxford, John Radcliffe Hosp, Nuffield Dept Med, Oxford OX3 9DU, England
[2] Univ Oxford, Publ Hlth England, Natl Inst Hlth Res, Hlth Res Protect Unit Healthcare Associated Infec, Oxford, England
[3] John Radcliffe Hosp, Oxford OX3 9DU, England
[4] Oxford Biomed Res Ctr, Natl Inst Hlth Res, Oxford, England
[5] Univ Edinburgh, Roslin Inst, Easter Bush Campus, Roslin EH25 9RG, Midlothian, Scotland
[6] Univ Edinburgh, Royal Dick Sch Vet Studies, Easter Bush Campus, Roslin EH25 9RG, Midlothian, Scotland
[7] EBI, European Mol Biol Lab, Wellcome Genome Campus, Hinxton CB10 1SH, Cambs, England
来源
GIGASCIENCE | 2020年 / 9卷 / 02期
基金
英国生物技术与生命科学研究理事会;
关键词
SNP calling; variant calling; evaluation; benchmarking; bacteria; CLOSTRIDIUM-DIFFICILE; READ ALIGNMENT; SEQUENCE; ALGORITHMS; FRAMEWORK; OUTBREAKS; VARIANTS; LINES; GUIDE; SET;
D O I
10.1093/gigascience/giaa007
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. Results: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. Conclusions: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.
引用
收藏
页数:21
相关论文
共 116 条
[51]   Reclassification of Clostridium difficile as Clostridioides difficile (Hall and O'Toole 1935) Prevot 1938 [J].
Lawson, Paul A. ;
Citron, Diane M. ;
Tyrrell, Kerin L. ;
Finegold, Sydney M. .
ANAEROBE, 2016, 40 :95-99
[52]   Does Choice Matter? Reference-Based Alignment for Molecular Epidemiology of Tuberculosis [J].
Lee, Robyn S. ;
Behr, Marcel A. .
JOURNAL OF CLINICAL MICROBIOLOGY, 2016, 54 (07) :1891-1895
[53]   MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping [J].
Lee, Wan-Ping ;
Stromberg, Michael P. ;
Ward, Alistair ;
Stewart, Chip ;
Garrison, Erik P. ;
Marth, Gabor T. .
PLOS ONE, 2014, 9 (03)
[54]   Minimap2: pairwise alignment for nucleotide sequences [J].
Li, Heng .
BIOINFORMATICS, 2018, 34 (18) :3094-3100
[55]   Toward better understanding of artifacts in variant calling from high-coverage samples [J].
Li, Heng .
BIOINFORMATICS, 2014, 30 (20) :2843-2851
[56]   A survey of sequence alignment algorithms for next-generation sequencing [J].
Li, Heng ;
Homer, Nils .
BRIEFINGS IN BIOINFORMATICS, 2010, 11 (05) :473-483
[57]   Fast and accurate short read alignment with Burrows-Wheeler transform [J].
Li, Heng ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (14) :1754-1760
[58]   An evaluation of the accuracy and speed of metagenome analysis tools [J].
Lindgreen, Stinus ;
Adair, Karen L. ;
Gardner, Paul P. .
SCIENTIFIC REPORTS, 2016, 6
[59]   Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data [J].
Liu, Qi ;
Guo, Yan ;
Li, Jiang ;
Long, Jirong ;
Zhang, Bing ;
Shyr, Yu .
BMC GENOMICS, 2012, 13
[60]   Variant Callers for Next-Generation Sequencing Data: A Comparison Study [J].
Liu, Xiangtao ;
Han, Shizhong ;
Wang, Zuoheng ;
Gelernter, Joel ;
Yang, Bao-Zhu .
PLOS ONE, 2013, 8 (09)