Self-Supervised Representation Learning for Basecalling Nanopore Sequencing Data

被引：0

作者：

Vintimilla, Carlos ^{[1
]}

Hwang, Sangheum ^{[1
,2
,3
]}

机构：

[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea

[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea

[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

DNA; Task analysis; Sequential analysis; Data models; Contrastive learning; Accuracy; Representation learning; Basecalling; nanopore sequencing; self-supervised learning; representation learning; wav2vec2.0;

D O I：

10.1109/ACCESS.2024.3440882

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.

引用

页码：109355 / 109366

页数：12

共 34 条

[1] Baevski Alexei, 2020, NEURIPS
[2] DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads
Boza, Vladimir
Brejova, Brona
Vinar, Tomas
[J]. PLOS ONE, 2017, 12 (06):
[3] Caron M, 2020, ADV NEUR IN, V33
[4] Chen T., 2020, Int. Conf. on Mach. Learn., P1597, DOI DOI 10.48550/ARXIV.2002.05709
[5] Simulation of Nanopore Sequencing Signals Based on BiGRU
Chen, Weigang
Zhang, Peng
Song, Lifu
Yang, Jinsheng
Han, Changcai
[J]. SENSORS, 2020, 20 (24) : 1 - 15
[6] Das A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4769, DOI 10.1109/ICASSP.2018.8461558
[7] Sequencing DNA with nanopores: Troubles and biases
Delahaye, Clara
Nicolas, Jacques
[J]. PLOS ONE, 2021, 16 (10):
[8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9] Self-Supervised Representation Learning: Introduction, advances, and challenges
Ericsson, Linus
Gouk, Henry
Loy, Chen Change
Hospedales, Timothy M.
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2022, 39 (03) : 42 - 62
[10] Masked Autoencoders Are Scalable Vision Learners
He, Kaiming
Chen, Xinlei
Xie, Saining
Li, Yanghao
Dollar, Piotr
Girshick, Ross
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15979 - 15988

← 1 2 3 4 →