NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

被引：0

作者：

Wang, Rongshu ^{[1
]}

Chen, Jianhua ^{[1
]}

机构：

[1] Yunnan Univ, Informat Sch, Dept Elect Engn, Kunming, Yunnan, Peoples R China

来源：

BMC GENOMICS | 2024年 / 25卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Long read; Hybrid error correction; Neural machine translation; Natural language processing;

D O I：

10.1186/s12864-024-10446-4

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

Backgrounds The single-pass long reads generated by third-generation sequencing technology exhibit a higher error rate. However, the circular consensus sequencing (CCS) produces shorter reads. Thus, it is effective to manage the error rate of long reads algorithmically with the help of the homologous high-precision and low-cost short reads from the Next Generation Sequencing (NGS) technology.Methods In this work, a hybrid error correction method (NmTHC) based on a generative neural machine translation model is proposed to automatically capture discrepancies within the aligned regions of long reads and short reads, as well as the contextual relationships within the long reads themselves for error correction. Akin to natural language sequences, the long read can be regarded as a special "genetic language" and be processed with the idea of generative neural networks. The algorithm builds a sequence-to-sequence(seq2seq) framework with Recurrent Neural Network (RNN) as the core layer. The before and post-corrected long reads are regarded as the sentences in the source and target language of translation, and the alignment information of long reads with short reads is used to create the special corpus for training. The well-trained model can be used to predict the corrected long read.Results NmTHC outperforms the latest mainstream hybrid error correction methods on real-world datasets from two mainstream platforms, including PacBio and Nanopore. Our experimental evaluation results demonstrate that NmTHC can align more bases with the reference genome without any segmenting in the six benchmark datasets, proving that it enhances alignment identity without sacrificing any length advantages of long reads.Conclusion Consequently, NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a novel perspective for solving long-read error correction problems with the ideas of Natural Language Processing (NLP). More remarkably, the proposed methodology is sequencing-technology-independent and can produce more precise reads.

引用

页数：21

共 38 条

[1] Improving PacBio Long Read Accuracy by Short Read Alignment
Au, Kin Fai
Underwood, Jason G.
Lee, Lawrence
Wong, Wing Hung
[J]. PLOS ONE, 2012, 7 (10):
[2] HALC: High throughput algorithm for long read error correction
Bao, Ergude
Lan, Lingxiao
[J]. BMC BIOINFORMATICS, 2017, 18
[3] DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads
Boza, Vladimir
Brejova, Brona
Vinar, Tomas
[J]. PLOS ONE, 2017, 12 (06):
[4] ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network
Cao, Renzhi
Freitas, Colton
Chan, Leong
Sun, Miao
Jiang, Haiqing
Chen, Zhangxin
[J]. MOLECULES, 2017, 22 (10):
[5] HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning
Choudhury, Olivia
Chakrabarty, Ankush
Emrich, Scott J.
[J]. SCIENTIFIC REPORTS, 2018, 8
[6] Chowdhary KR., 2020, FUNDAMENTALS ARTIFIC, P603, DOI [DOI 10.1007/978-81-322-3972-7_19, DOI 10.1007/978-81-322-3972-719]
[7] Emerging Trends Word2Vec
Church, Kenneth Ward
[J]. NATURAL LANGUAGE ENGINEERING, 2017, 23 (01) : 155 - 162
[8] A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
Das, Arghya Kusum
Goswami, Sayan
Lee, Kisung
Park, Seung-Jong
[J]. BMC GENOMICS, 2019, 20 (Suppl 11)
[9] Dey R, 2017, MIDWEST SYMP CIRCUIT, P1597, DOI 10.1109/MWSCAS.2017.8053243
[10] Durbin R., 1998, BIOL SEQUENCE ANAL P

← 1 2 3 4 →