An efficient simulator of 454 data using configurable statistical models

被引:24
作者
Lysholm F. [1 ]
Andersson B. [2 ]
Persson B. [1 ,2 ]
机构
[1] IFM Bioinformatics and SeRC (Swedish E-Science Research Centre), Linköping University
[2] Department of Cell and Molecular Biology, Science for Life Laboratory, Karolinska Institutet
关键词
Text Editor; Multiple Thread; Bioinformatic Application; Generation Sequencing Technique; Simulated Read;
D O I
10.1186/1756-0500-4-449
中图分类号
学科分类号
摘要
Background: Roche 454 is one of the major 2 nd generation sequencing platforms. The particular characteristics of 454 sequence data pose new challenges for bioinformatic analyses, e.g. assembly and alignment search algorithms. Simulation of these data is therefore useful, in order to further assess how bioinformatic applications and algorithms handle 454 data. Findings. We developed a new application named 454sim for simulation of 454 data at high speed and accuracy. The program is multi-thread capable and is available as C++ source code or pre-compiled binaries. Sequence reads are simulated by 454sim using a set of statistical models for each chemistry. 454sim simulates recorded peak intensities, peak quality deterioration and it calculates quality values. All three generations of the Roche 454 chemistry ('GS20', 'GS FLX' and 'Titanium') are supported and defined in external text files for easy access and tweaking. Conclusions: We present a new platform independent application named 454sim. 454sim is generally 200 times faster compared to previous programs and it allows for simple adjustments of the statistical models. These improvements make it possible to carry out more complex and rigorous algorithm evaluations in a reasonable time scale. © 2011 Lysholm et al; licensee BioMed Central Ltd.
引用
收藏
相关论文
共 8 条
[1]  
Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.-J., Chen Z., Dewell S.B., Du L., Fierro J.M., Gomes X.V., Godwin B.C., He W., Helgesen S., Ho C.H., Irzyk G.P., Jando S.C., Alenquer M.L.I., Jarvie T.P., Jirage K.B., Kim J.-B., Knight J.R., Lanza J.R., Leamon J.H., Lefkowitz S.M., Lei M., Li J., Lohman K.L., Lu H., Makhijani V.B., McDade K.E., McKenna M.P., Myers E.W., Nickerson E., Nobile J.R., Plant R., Puc B.P., Ronan M.T., Roth G.T.,
[2]  
Huse S.M., Huber J.A., Morrison H.G., Sogin M.L., Welch D.M., Accuracy and quality of massively parallel DNA pyrosequencing, Genome Biology, 8, 7, (2007)
[3]  
Gomez-Alvarez V., Teal T.K., Schmidt T.M., Systematic artifacts in metagenomes from complex microbial communities, ISME J, 3, pp. 1314-1317, (2009)
[4]  
Quince C., Lanzen A., Curtis T.P., Davenport R.J., Hall N., Head I.M., Read L.F., Sloan W.T., Accurate determination of microbial diversity from 454 pyrosequencing data, Nat Methods, 6, pp. 639-641, (2009)
[5]  
Richter D.C., Ott F., Auch A.F., Schmid R., Huson D.H., MetaSim: A sequencing simulator for genomics and metagenomics, PLoS One, 3, (2008)
[6]  
Balzer S., Malde K., Lanzen A., Sharma A., Jonassen I., Characteristics of 454 pyrosequencing data - Enabling realistic simulation with flowsim, Bioinformatics, 26, (2010)
[7]  
Marsaglia G., Tsang W.W., The ziggurat method for generating random variables, Journal of Statistical Software, 5, pp. 1-7, (2000)
[8]  
Matsumoto M., Nishimura T., Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Trans Model Comput Simulat, 8, pp. 3-30, (1998)