Self-Supervised Representation Learning for Basecalling Nanopore Sequencing Data

被引：0

作者：

Vintimilla, Carlos ^{[1
]}

Hwang, Sangheum ^{[1
,2
,3
]}

机构：

[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea

[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea

[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

DNA; Task analysis; Sequential analysis; Data models; Contrastive learning; Accuracy; Representation learning; Basecalling; nanopore sequencing; self-supervised learning; representation learning; wav2vec2.0;

D O I：

10.1109/ACCESS.2024.3440882

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.

引用

页码：109355 / 109366

页数：12

共 34 条

[31] MSRCall: a multi-scale deep neural network to basecall Oxford Nanopore sequences
Yeh, Yang-Ming
Lu, Yi-Chang
[J]. BIOINFORMATICS, 2022, 38 (16) : 3877 - 3884
[32] Yu H, 2022, Arxiv, DOI [arXiv:2210.07340, 10.48550/arXiv.2210.07340, DOI 10.48550/ARXIV.2210.07340]
[33] Causalcall: Nanopore Basecalling Using a Temporal Convolutional Network
Zeng, Jingwen
Cai, Hongmin
Peng, Hong
Wang, Haiyan
Zhang, Yue
Akutsu, Tatsuya
[J]. FRONTIERS IN GENETICS, 2020, 10
[34] The newest Oxford Nanopore R10.4.1 full-length 16S rRNA sequencing enables the accurate resolution of species-level microbial community profiling
Zhang, Tianyuan
Li, Hanzhou
Ma, Silin
Cao, Jian
Liao, Hao
Huang, Qiaoyun
Chen, Wenli
[J]. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2023, 89 (10)

← 1 2 3 4 →