GemSIM: general, error-model based simulator of next-generation sequencing data

被引:114
作者
McElroy, Kerensa E. [1 ,2 ,3 ]
Luciani, Fabio [3 ]
Thomas, Torsten [1 ,2 ]
机构
[1] UNSW, Ctr Marine Bioinnovat, Sydney, NSW 2052, Australia
[2] UNSW, Sch Biotechnol & Biomol Sci, Sydney, NSW 2052, Australia
[3] Univ New S Wales, Sch Med Sci, Inflammat & Infect Res Grp, Sydney, NSW 2052, Australia
基金
英国医学研究理事会; 澳大利亚国家健康与医学研究理事会;
关键词
QUALITY; ACCURACY; FORMAT;
D O I
10.1186/1471-2164-13-74
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects. Results: We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read. The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs. Conclusions: Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.
引用
收藏
页数:9
相关论文
共 22 条
[1]  
[Anonymous], SEQUENCE ASSEMBLY MI
[2]  
[Anonymous], GENOME ANAL ILX
[3]  
[Anonymous], SIMSEQ
[4]  
[Anonymous], WHOLE GENOME SIMULAT
[5]  
[Anonymous], MOSAIK ALIGNER
[6]  
[Anonymous], 2014, The Art
[7]   Characteristics of 454 pyrosequencing data-enabling realistic simulation with flowsim [J].
Balzer, Susanne ;
Malde, Ketil ;
Lanzen, Anders ;
Sharma, Animesh ;
Jonassen, Inge .
BIOINFORMATICS, 2010, 26 (18) :i420-i425
[8]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[9]   Sequential Bottlenecks Drive Viral Evolution in Early Acute Hepatitis C Virus Infection [J].
Bull, Rowena A. ;
Luciani, Fabio ;
McElroy, Kerensa ;
Gaudieri, Silvana ;
Pham, Son T. ;
Chopra, Abha ;
Cameron, Barbara ;
Maher, Lisa ;
Dore, Gregory J. ;
White, Peter A. ;
Lloyd, Andrew R. .
PLOS PATHOGENS, 2011, 7 (09)
[10]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771