NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

被引:0
作者
Wang, Rongshu [1 ]
Chen, Jianhua [1 ]
机构
[1] Yunnan Univ, Informat Sch, Dept Elect Engn, Kunming, Yunnan, Peoples R China
来源
BMC GENOMICS | 2024年 / 25卷 / 01期
基金
中国国家自然科学基金;
关键词
Long read; Hybrid error correction; Neural machine translation; Natural language processing;
D O I
10.1186/s12864-024-10446-4
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Backgrounds The single-pass long reads generated by third-generation sequencing technology exhibit a higher error rate. However, the circular consensus sequencing (CCS) produces shorter reads. Thus, it is effective to manage the error rate of long reads algorithmically with the help of the homologous high-precision and low-cost short reads from the Next Generation Sequencing (NGS) technology.Methods In this work, a hybrid error correction method (NmTHC) based on a generative neural machine translation model is proposed to automatically capture discrepancies within the aligned regions of long reads and short reads, as well as the contextual relationships within the long reads themselves for error correction. Akin to natural language sequences, the long read can be regarded as a special "genetic language" and be processed with the idea of generative neural networks. The algorithm builds a sequence-to-sequence(seq2seq) framework with Recurrent Neural Network (RNN) as the core layer. The before and post-corrected long reads are regarded as the sentences in the source and target language of translation, and the alignment information of long reads with short reads is used to create the special corpus for training. The well-trained model can be used to predict the corrected long read.Results NmTHC outperforms the latest mainstream hybrid error correction methods on real-world datasets from two mainstream platforms, including PacBio and Nanopore. Our experimental evaluation results demonstrate that NmTHC can align more bases with the reference genome without any segmenting in the six benchmark datasets, proving that it enhances alignment identity without sacrificing any length advantages of long reads.Conclusion Consequently, NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a novel perspective for solving long-read error correction problems with the ideas of Natural Language Processing (NLP). More remarkably, the proposed methodology is sequencing-technology-independent and can produce more precise reads.
引用
收藏
页数:21
相关论文
共 38 条
  • [1] Improving PacBio Long Read Accuracy by Short Read Alignment
    Au, Kin Fai
    Underwood, Jason G.
    Lee, Lawrence
    Wong, Wing Hung
    [J]. PLOS ONE, 2012, 7 (10):
  • [2] HALC: High throughput algorithm for long read error correction
    Bao, Ergude
    Lan, Lingxiao
    [J]. BMC BIOINFORMATICS, 2017, 18
  • [3] DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads
    Boza, Vladimir
    Brejova, Brona
    Vinar, Tomas
    [J]. PLOS ONE, 2017, 12 (06):
  • [4] ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network
    Cao, Renzhi
    Freitas, Colton
    Chan, Leong
    Sun, Miao
    Jiang, Haiqing
    Chen, Zhangxin
    [J]. MOLECULES, 2017, 22 (10):
  • [5] HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning
    Choudhury, Olivia
    Chakrabarty, Ankush
    Emrich, Scott J.
    [J]. SCIENTIFIC REPORTS, 2018, 8
  • [6] Chowdhary KR., 2020, FUNDAMENTALS ARTIFIC, P603, DOI [DOI 10.1007/978-81-322-3972-7_19, DOI 10.1007/978-81-322-3972-719]
  • [7] Emerging Trends Word2Vec
    Church, Kenneth Ward
    [J]. NATURAL LANGUAGE ENGINEERING, 2017, 23 (01) : 155 - 162
  • [8] A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
    Das, Arghya Kusum
    Goswami, Sayan
    Lee, Kisung
    Park, Seung-Jong
    [J]. BMC GENOMICS, 2019, 20 (Suppl 11)
  • [9] Dey R, 2017, MIDWEST SYMP CIRCUIT, P1597, DOI 10.1109/MWSCAS.2017.8053243
  • [10] Durbin R., 1998, BIOL SEQUENCE ANAL P