HALC: High throughput algorithm for long read error correction

被引:37
作者
Bao, Ergude [1 ,2 ]
Lan, Lingxiao [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Software Engn, 3 Shangyuan Residence, Beijing 100044, Peoples R China
[2] Univ Calif Riverside, Dept Bot & Plant Sci, 900 Univ Ave, Riverside, CA 92521 USA
来源
BMC BIOINFORMATICS | 2017年 / 18卷
基金
美国国家科学基金会;
关键词
PacBio long reads; Error correction; Throughput; MOLECULE SEQUENCING READS; BASIC LOCAL ALIGNMENT; RNA-SEQ DATA; GENOME ASSEMBLIES; TOOL; ACCURATE;
D O I
10.1186/s12859-017-1610-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. Results: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. Conclusions: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.
引用
收藏
页数:12
相关论文
共 35 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] [Anonymous], 2016, BIORXIV
  • [3] Characterization of the human ESC transcriptome by hybrid sequencing
    Au, Kin Fai
    Sebastiano, Vittorio
    Afshar, Pegah Tootoonchi
    Durruthy, Jens Durruthy
    Lee, Lawrence
    Williams, Brian A.
    van Bakel, Harm
    Schadt, Eric E.
    Reijo-Pera, Renee A.
    Underwood, Jason G.
    Wong, Wing Hung
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2013, 110 (50) : E4821 - E4830
  • [4] SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
    Bankevich, Anton
    Nurk, Sergey
    Antipov, Dmitry
    Gurevich, Alexey A.
    Dvorkin, Mikhail
    Kulikov, Alexander S.
    Lesin, Valery M.
    Nikolenko, Sergey I.
    Son Pham
    Prjibelski, Andrey D.
    Pyshkin, Alexey V.
    Sirotkin, Alexander V.
    Vyahhi, Nikolay
    Tesler, Glenn
    Alekseyev, Max A.
    Pevzner, Pavel A.
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2012, 19 (05) : 455 - 477
  • [5] Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory
    Chaisson, Mark J.
    Tesler, Glenn
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [6] Chin CS, 2013, NAT METHODS, V10, P563, DOI [10.1038/nmeth.2474, 10.1038/NMETH.2474]
  • [7] An improved genome reference for the African cichlid, Metriaclima zebra
    Conte, Matthew A.
    Kocher, Thomas D.
    [J]. BMC GENOMICS, 2015, 16
  • [8] Alignment of whole genomes
    Delcher, AL
    Kasif, S
    Fleischmann, RD
    Peterson, J
    White, O
    Salzberg, SL
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (11) : 2369 - 2376
  • [9] Deshpande Viraj, 2013, Algorithms in Bioinformatics. 13th International Workshop, WABI 2013. Proceedings: LNCS 8126, P349, DOI 10.1007/978-3-642-40453-5_27
  • [10] Real-Time DNA Sequencing from Single Polymerase Molecules
    Eid, John
    Fehr, Adrian
    Gray, Jeremy
    Luong, Khai
    Lyle, John
    Otto, Geoff
    Peluso, Paul
    Rank, David
    Baybayan, Primo
    Bettman, Brad
    Bibillo, Arkadiusz
    Bjornson, Keith
    Chaudhuri, Bidhan
    Christians, Frederick
    Cicero, Ronald
    Clark, Sonya
    Dalal, Ravindra
    deWinter, Alex
    Dixon, John
    Foquet, Mathieu
    Gaertner, Alfred
    Hardenbol, Paul
    Heiner, Cheryl
    Hester, Kevin
    Holden, David
    Kearns, Gregory
    Kong, Xiangxu
    Kuse, Ronald
    Lacroix, Yves
    Lin, Steven
    Lundquist, Paul
    Ma, Congcong
    Marks, Patrick
    Maxham, Mark
    Murphy, Devon
    Park, Insil
    Pham, Thang
    Phillips, Michael
    Roy, Joy
    Sebra, Robert
    Shen, Gene
    Sorenson, Jon
    Tomaney, Austin
    Travers, Kevin
    Trulson, Mark
    Vieceli, John
    Wegener, Jeffrey
    Wu, Dawn
    Yang, Alicia
    Zaccarin, Denis
    [J]. SCIENCE, 2009, 323 (5910) : 133 - 138