HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data

被引:20
作者
Wirawan, Adrianto [1 ]
Harris, Robert S. [2 ]
Liu, Yongchao [1 ]
Schmidt, Bertil [1 ]
Schroeder, Jan [3 ,4 ]
机构
[1] Johannes Gutenberg Univ Mainz, Inst Informat, D-55122 Mainz, Germany
[2] Penn State Univ, Dept Biol, State Coll, PA 16801 USA
[3] Walter & Eliza Hall Inst Med Res, Bioinformat Div, Melbourne, Vic, Australia
[4] Univ Melbourne, Dept Mol Med, Melbourne, Vic, Australia
基金
澳大利亚国家健康与医学研究理事会; 英国医学研究理事会;
关键词
NGS error correction; Homopolymer-length error; 454; sequencing; Parallelization; LONG-READ ALIGNMENT; QUALITY; CUDA;
D O I
10.1186/1471-2105-15-131
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Current-generation sequencing technologies are able to produce low-cost, high-throughput reads. However, the produced reads are imperfect and may contain various sequencing errors. Although many error correction methods have been developed in recent years, none explicitly targets homopolymer-length errors in the 454 sequencing reads. Results: We present HECTOR, a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. In this algorithm, for the first time we have investigated a novel homopolymer spectrum based approach to handle homopolymer insertions or deletions, which are the dominant sequencing errors in 454 pyrosequencing reads. We have evaluated the performance of HECTOR, in terms of correction quality, runtime and parallel scalability, using both simulated and real pyrosequencing datasets. This performance has been further compared to that of Coral, a state-of-the-art error corrector which is based on multiple sequence alignment and Acacia, a recently published error corrector for amplicon pyrosequences. Our evaluations reveal that HECTOR demonstrates comparable correction quality to Coral, but runs 3.7x faster on average. In addition, HECTOR performs well even when the coverage of the dataset is low. Conclusion: Our homopolymer spectrum based approach is theoretically capable of processing arbitrary-length homopolymer-length errors, with a linear time complexity. HECTOR employs a multi-threaded design based on a master-slave computing model. Our experimental results show that HECTOR is a practical 454 pyrosequencing read error corrector which is competitive in terms of both correction quality and speed. The source code and all simulated data are available at: http://hector454.sourceforge.net.
引用
收藏
页数:13
相关论文
共 32 条
[1]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[2]   SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS [J].
BLOOM, BH .
COMMUNICATIONS OF THE ACM, 1970, 13 (07) :422-&
[3]   Fast, accurate error-correction of amplicon pyrosequences using Acacia [J].
Bragg, Lauren ;
Stone, Glenn ;
Imelfort, Michael ;
Hugenholtz, Philip ;
Tyson, Gene W. .
NATURE METHODS, 2012, 9 (05) :425-426
[4]   Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data [J].
Bragg, Lauren M. ;
Stone, Glenn ;
Butler, Margaret K. ;
Hugenholtz, Philip ;
Tyson, Gene W. .
PLOS COMPUTATIONAL BIOLOGY, 2013, 9 (04)
[5]   Fragment assembly with short reads [J].
Chaisson, M ;
Pevzner, P ;
Tang, HX .
BIOINFORMATICS, 2004, 20 (13) :2067-2074
[6]   Short read fragment assembly of bacterial genomes [J].
Chaisson, Mark J. ;
Pevzner, Pavel A. .
GENOME RESEARCH, 2008, 18 (02) :324-330
[7]   Genomic analysis of the native European Solanum species, S. dulcamara [J].
D'Agostino, Nunzio ;
Golas, Tomek ;
van de Geest, Henri ;
Bombarely, Aureliano ;
Dawood, Thikra ;
Zethof, Jan ;
Driedonks, Nicky ;
Wijnker, Erik ;
Bargsten, Joachim ;
Nap, Jan-Peter ;
Mariani, Celestina ;
Rieu, Ivo .
BMC GENOMICS, 2013, 14
[8]   SHRiMP2: Sensitive yet Practical Short Read Mapping [J].
David, Matei ;
Dzamba, Misko ;
Lister, Dan ;
Ilie, Lucian ;
Brudno, Michael .
BIOINFORMATICS, 2011, 27 (07) :1011-1012
[9]   Real-Time DNA Sequencing from Single Polymerase Molecules [J].
Eid, John ;
Fehr, Adrian ;
Gray, Jeremy ;
Luong, Khai ;
Lyle, John ;
Otto, Geoff ;
Peluso, Paul ;
Rank, David ;
Baybayan, Primo ;
Bettman, Brad ;
Bibillo, Arkadiusz ;
Bjornson, Keith ;
Chaudhuri, Bidhan ;
Christians, Frederick ;
Cicero, Ronald ;
Clark, Sonya ;
Dalal, Ravindra ;
deWinter, Alex ;
Dixon, John ;
Foquet, Mathieu ;
Gaertner, Alfred ;
Hardenbol, Paul ;
Heiner, Cheryl ;
Hester, Kevin ;
Holden, David ;
Kearns, Gregory ;
Kong, Xiangxu ;
Kuse, Ronald ;
Lacroix, Yves ;
Lin, Steven ;
Lundquist, Paul ;
Ma, Congcong ;
Marks, Patrick ;
Maxham, Mark ;
Murphy, Devon ;
Park, Insil ;
Pham, Thang ;
Phillips, Michael ;
Roy, Joy ;
Sebra, Robert ;
Shen, Gene ;
Sorenson, Jon ;
Tomaney, Austin ;
Travers, Kevin ;
Trulson, Mark ;
Vieceli, John ;
Wegener, Jeffrey ;
Wu, Dawn ;
Yang, Alicia ;
Zaccarin, Denis .
SCIENCE, 2009, 323 (5910) :133-138
[10]  
Holtgrewe Manuel., 2010, MASON READ SIMULATOR