Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines

被引:33
作者
Frampton, Matthew [1 ]
Houlston, Richard [1 ]
机构
[1] Inst Canc Res, Div Genet & Epidemiol, Sutton, Surrey, England
来源
PLOS ONE | 2012年 / 7卷 / 11期
关键词
FRAMEWORK; SIMULATOR; VARIANTS; FORMAT; GENOME;
D O I
10.1371/journal.pone.0049110
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Pipelines for the analysis of Next-Generation Sequencing (NGS) data are generally composed of a set of different publicly available software, configured together in order to map short reads of a genome and call variants. The fidelity of pipelines is variable. We have developed ArtificialFastqGenerator, which takes a reference genome sequence as input and outputs artificial paired-end FASTQ files containing Phred quality scores. Since these artificial FASTQs are derived from the reference genome, it provides a gold-standard for read-alignment and variant-calling, thereby enabling the performance of any NGS pipeline to be evaluated. The user can customise DNA template/read length, the modelling of coverage based on GC content, whether to use real Phred base quality scores taken from existing FASTQ files, and whether to simulate sequencing errors. Detailed coverage and error summary statistics are outputted. Here we describe ArtificialFastqGenerator and illustrate its implementation in evaluating a typical bespoke NGS analysis pipeline under different experimental conditions. ArtificialFastqGenerator was released in January 2012. Source code, example files and binaries are freely available under the terms of the GNU General Public License v3.0. from https://sourceforge.net/projects/artfastqgen/.
引用
收藏
页数:5
相关论文
共 12 条
[1]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[2]   A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[3]  
Holtgrewe M., 2010, TECHNICAL REPORT
[4]   pIRS: Profile-based Illumina pair-end reads simulator [J].
Hu, Xuesong ;
Yuan, Jianying ;
Shi, Yujian ;
Lu, Jianliang ;
Liu, Binghang ;
Li, Zhenyu ;
Chen, Yanxiang ;
Mu, Desheng ;
Zhang, Hao ;
Li, Nan ;
Yue, Zhen ;
Bai, Fan ;
Li, Heng ;
Fan, Wei .
BIOINFORMATICS, 2012, 28 (11) :1533-1535
[5]   ART: a next-generation sequencing read simulator [J].
Huang, Weichun ;
Li, Leping ;
Myers, Jason R. ;
Marth, Gabor T. .
BIOINFORMATICS, 2012, 28 (04) :593-594
[6]   The Sequence Alignment/Map format and SAMtools [J].
Li, Heng ;
Handsaker, Bob ;
Wysoker, Alec ;
Fennell, Tim ;
Ruan, Jue ;
Homer, Nils ;
Marth, Gabor ;
Abecasis, Goncalo ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (16) :2078-2079
[7]   Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads [J].
Lunter, Gerton ;
Goodson, Martin .
GENOME RESEARCH, 2011, 21 (06) :936-939
[8]   The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data [J].
McKenna, Aaron ;
Hanna, Matthew ;
Banks, Eric ;
Sivachenko, Andrey ;
Cibulskis, Kristian ;
Kernytsky, Andrew ;
Garimella, Kiran ;
Altshuler, David ;
Gabriel, Stacey ;
Daly, Mark ;
DePristo, Mark A. .
GENOME RESEARCH, 2010, 20 (09) :1297-1303
[9]   Identification and correction of systematic error in high-throughput sequence data [J].
Meacham, Frazer ;
Boffelli, Dario ;
Dhahbi, Joseph ;
Martin, David I. K. ;
Singer, Meromit ;
Pachter, Lior .
BMC BIOINFORMATICS, 2011, 12
[10]  
Rimmer A, PLATYPUS INTEGRATED