Software for pre-processing Illumina next-generation sequencing short read sequences

被引：187

作者：

Chen, Chuming ^{[1
]}

Khaleel, Sari S. ^{[2
]}

Huang, Hongzhan ^{[1
]}

Wu, Cathy H. ^{[1
]}

机构：

[1] Univ Delaware, Ctr Bioinformat & Computat Biol, Newark, DE 19711 USA

[2] Dartmouth Coll, Geisel Sch Med, Hanover, NH 03755 USA

来源：

SOURCE CODE FOR BIOLOGY AND MEDICINE | 2014年 / 9卷 / 01期

基金：

美国国家卫生研究院;

关键词：

Next-generation sequencing; Illumina; Trimming; De novo assembly; Reference-based assembly; Perl;

D O I：

10.1186/1751-0473-9-8

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background: When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. Methods: We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Results: Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. Conclusions: Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies.

引用

页数：11

共 39 条

[1] APPLICATIONS OF NEXT-GENERATION SEQUENCING Genome structural variation discovery and genotyping [J].

Alkan, Can ;

Coe, Bradley P. ;

Eichler, Evan E. .

NATURE REVIEWS GENETICS, 2011, 12 (05) :363-375

[2] Whole genome sequencing of enriched chloroplast DNA using the Illumina GAII platform [J].

Atherton, Robin A. ;

McComish, Bennet J. ;

Shepherd, Lara D. ;

Berry, Lorraine A. ;

Albert, Nick W. ;

Lockhart, Peter J. .

PLANT METHODS, 2010, 6

[3] High-resolution profiling of histone methylations in the human genome [J].

Barski, Artern ;

Cuddapah, Suresh ;

Cui, Kairong ;

Roh, Tae-Young ;

Schones, Dustin E. ;

Wang, Zhibin ;

Wei, Gang ;

Chepelev, Iouri ;

Zhao, Keji .

CELL, 2007, 129 (04) :823-837

[4] Trimmomatic: a flexible trimmer for Illumina sequence data [J].

Bolger, Anthony M. ;

Lohse, Marc ;

Usadel, Bjoern .

BIOINFORMATICS, 2014, 30 (15) :2114-2120

[5]

Buffalo V., SCYTHE BAYESIAN ADAP

[6]

CLC Bio, CLC BIO GEN WORKB US

[7] SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J].

Cox, Murray P. ;

Peterson, Daniel A. ;

Biggs, Patrick J. .

BMC BIOINFORMATICS, 2010, 11

[8] De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data [J].

DiGuistini, Scott ;

Liao, Nancy Y. ;

Platt, Darren ;

Robertson, Gordon ;

Seidel, Michael ;

Chan, Simon K. ;

Docking, T. Roderick ;

Birol, Inanc ;

Holt, Robert A. ;

Hirst, Martin ;

Mardis, Elaine ;

Marra, Marco A. ;

Hamelin, Richard C. ;

Bohlmann, Joerg ;

Breuil, Colette ;

Jones, Steven J. M. .

GENOME BIOLOGY, 2009, 10 (09)

[9] Assemblathon 1: A competitive assessment of de novo short read assembly methods [J].

Earl, Dent ;

Bradnam, Keith ;

St John, John ;

Darling, Aaron ;

Lin, Dawei ;

Fass, Joseph ;

Hung On Ken Yu ;

Buffalo, Vince ;

Zerbino, Daniel R. ;

Diekhans, Mark ;

Ngan Nguyen ;

Ariyaratne, Pramila Nuwantha ;

Sung, Wing-Kin ;

Ning, Zemin ;

Haimel, Matthias ;

Simpson, Jared T. ;

Fonseca, Nuno A. ;

Birol, Inanc ;

Docking, T. Roderick ;

Ho, Isaac Y. ;

Rokhsar, Daniel S. ;

Chikhi, Rayan ;

Lavenier, Dominique ;

Chapuis, Guillaume ;

Naquin, Delphine ;

Maillet, Nicolas ;

Schatz, Michael C. ;

Kelley, David R. ;

Phillippy, Adam M. ;

Koren, Sergey ;

Yang, Shiaw-Pyng ;

Wu, Wei ;

Chou, Wen-Chi ;

Srivastava, Anuj ;

Shaw, Timothy I. ;

Ruby, J. Graham ;

Skewes-Cox, Peter ;

Betegon, Miguel ;

Dimon, Michelle T. ;

Solovyev, Victor ;

Seledtsov, Igor ;

Kosarev, Petr ;

Vorobyev, Denis ;

Ramirez-Gonzalez, Ricardo ;

Leggett, Richard ;

MacLean, Dan ;

Xia, Fangfang ;

Luo, Ruibang ;

Li, Zhenyu ;

Xie, Yinlong .

GENOME RESEARCH, 2011, 21 (12) :2224-2241

[10] SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read [J].

Falgueras, Juan ;

Lara, Antonio J. ;

Fernandez-Pozo, Noe ;

Canton, Francisco R. ;

Perez-Trabado, Guillermo ;

Gonzalo Claros, M. .

BMC BIOINFORMATICS, 2010, 11

← 1 2 3 4 →