SnpFilt: A pipeline for reference-free-based identification of SNPs in bacterial genomes

被引:19
作者
Chan, Carmen H. S. [1 ]
Octavia, Sophie [1 ]
Sintchenko, Vitali [2 ,3 ]
Lan, Ruiting [1 ]
机构
[1] Univ New South Wales, Sch Biotechnol & Biomol Sci, Sydney, NSW 2052, Australia
[2] Westmead Hosp, Inst Clin Pathol & Med Res, Ctr Infect Dis & Microbiol Publ Hlth, Westmead, NSW, Australia
[3] Univ Sydney, Marie Bashir Inst Infect Dis & Biosecur, Sydney, NSW 2006, Australia
基金
英国医学研究理事会;
关键词
Next generation sequencing; Genome assembly; Single nucleotide polymorphisms; Reference free SNP discovery; ENTERICA SEROVAR TYPHIMURIUM; SEQUENCING DATA; SURVEILLANCE; ASSEMBLIES; FRAMEWORK; OUTBREAKS; ACCURACY; GENOTYPE; QUALITY;
D O I
10.1016/j.compbiolchem.2016.09.004
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
De novo assembly of bacterial genomes from next-generation sequencing (NGS) data allows a reference free discovery of single nucleotide polymorphisms (SNP). However, substantial rates of errors in genomes assembled by this approach remain a major barrier for the reference-free analysis of genome variations in medically important bacteria. The aim of this report was to improve the quality of SNP identification in bacterial genomes without closely related references. We developed a bioinformatics pipeline (SnpFilt) that constructs an assembly using SPAdes and then removes unreliable regions based on the quality and coverage of re-aligned reads at neighbouring regions. The performance of the pipeline was compared against reference-based SNP calling for Illumina HiSeq, MiSeq and NextSeq reads from a range of bacterial pathogens including Salmonella, which is one of the most common causes of food-borne disease. The SnpFilt pipeline removed all false SNP in all test NGS datasets consisting of paired-end Illumina reads. We also showed that for reliable and complete SNP calls, at least 40-fold coverage is required. Analysis of bacterial isolates associated with epidemiologically confirmed outbreaks using the SnpFilt pipeline produced results consistent with previously published findings. The SnpFilt pipeline improves the quality of de-novo assembly and precision of SNP calling in bacterial genomes by removal of regions of the assembly that may potentially contain assembly errors. SnpFilt is available from https://github.com/ LanLab/SnpFilt. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:178 / 184
页数:7
相关论文
共 39 条
[1]   Limitations of next-generation genome sequence assembly [J].
Alkan, Can ;
Sajjadian, Saba ;
Eichler, Evan E. .
NATURE METHODS, 2011, 8 (01) :61-65
[2]  
[Anonymous], ALIGNING SEQUENCE RE, DOI DOI 10.48550/ARXIV.1303.3997
[3]   SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing [J].
Bankevich, Anton ;
Nurk, Sergey ;
Antipov, Dmitry ;
Gurevich, Alexey A. ;
Dvorkin, Mikhail ;
Kulikov, Alexander S. ;
Lesin, Valery M. ;
Nikolenko, Sergey I. ;
Son Pham ;
Prjibelski, Andrey D. ;
Pyshkin, Alexey V. ;
Sirotkin, Alexander V. ;
Vyahhi, Nikolay ;
Tesler, Glenn ;
Alekseyev, Max A. ;
Pevzner, Pavel A. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2012, 19 (05) :455-477
[4]   Genome Project Standards in a New Era of Sequencing [J].
Chain, P. S. G. ;
Grafham, D. V. ;
Fulton, R. S. ;
FitzGerald, M. G. ;
Hostetler, J. ;
Muzny, D. ;
Ali, J. ;
Birren, B. ;
Bruce, D. C. ;
Buhay, C. ;
Cole, J. R. ;
Ding, Y. ;
Dugan, S. ;
Field, D. ;
Garrity, G. M. ;
Gibbs, R. ;
Graves, T. ;
Han, C. S. ;
Harrison, S. H. ;
Highlander, S. ;
Hugenholtz, P. ;
Khouri, H. M. ;
Kodira, C. D. ;
Kolker, E. ;
Kyrpides, N. C. ;
Lang, D. ;
Lapidus, A. ;
Malfatti, S. A. ;
Markowitz, V. ;
Metha, T. ;
Nelson, K. E. ;
Parkhill, J. ;
Pitluck, S. ;
Qin, X. ;
Read, T. D. ;
Schmutz, J. ;
Sozhamannan, S. ;
Sterk, P. ;
Strausberg, R. L. ;
Sutton, G. ;
Thomson, N. R. ;
Tiedje, J. M. ;
Weinstock, G. ;
Wollam, A. ;
Detter, J. C. .
SCIENCE, 2009, 326 (5950) :236-237
[5]   ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies [J].
Clark, Scott C. ;
Egan, Rob ;
Frazier, Peter I. ;
Wang, Zhong .
BIOINFORMATICS, 2013, 29 (04) :435-443
[6]   Whole-Genome Sequencing for National Surveillance of Shiga Toxin-Producing Escherichia coli O157 [J].
Dallman, Timothy J. ;
Byrne, Lisa ;
Ashton, Philip M. ;
Cowley, Lauren A. ;
Perry, Neil T. ;
Adak, Goutam ;
Petrovska, Liljana ;
Ellis, Richard J. ;
Elson, Richard ;
Underwood, Anthony ;
Green, Jonathan ;
Hanage, William P. ;
Jenkins, Claire ;
Grant, Kathie ;
Wain, John .
CLINICAL INFECTIOUS DISEASES, 2015, 61 (03) :305-312
[7]  
Davis J.J., 2006, P INT C MACH LEARN I
[8]   Rapid Whole-Genome Sequencing for Surveillance of Salmonella enterica Serovar Enteritidis [J].
den Bakker, Henk C. ;
Allard, Marc W. ;
Bopp, Dianna ;
Brown, Eric W. ;
Fontana, John ;
Iqbal, Zamin ;
Kinney, Aristea ;
Limberger, Ronald ;
Musser, Kimberlee A. ;
Shudt, Matthew ;
Strain, Errol ;
Wiedmann, Martin ;
Wolfgang, William J. .
EMERGING INFECTIOUS DISEASES, 2014, 20 (08) :1306-1314
[9]   Estimating the burden of acute gastroenteritis, foodborne disease, and pathogens commonly transmitted by food: An international review [J].
Flint, JA ;
Van Duynhoven, YT ;
Angulo, FJ ;
DeLong, SM ;
Braun, PG ;
Kirk, M ;
Scallan, E ;
Fitzgerald, M ;
Adak, GK ;
Sockett, P ;
Ellis, A ;
Hall, G ;
Gargouri, N ;
Walke, H ;
Braam, P .
CLINICAL INFECTIOUS DISEASES, 2005, 41 (05) :698-704
[10]   Defining the Core Genome of Salmonella enterica Serovar Typhimurium for Genomic Surveillance and Epidemiological Typing [J].
Fu, Songzhe ;
Octavia, Sophie ;
Tanaka, Mark M. ;
Sintchenko, Vitali ;
Lan, Ruiting .
JOURNAL OF CLINICAL MICROBIOLOGY, 2015, 53 (08) :2530-2538