Self-Supervised Representation Learning for Basecalling Nanopore Sequencing Data

被引:0
作者
Vintimilla, Carlos [1 ]
Hwang, Sangheum [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea
[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
DNA; Task analysis; Sequential analysis; Data models; Contrastive learning; Accuracy; Representation learning; Basecalling; nanopore sequencing; self-supervised learning; representation learning; wav2vec2.0;
D O I
10.1109/ACCESS.2024.3440882
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.
引用
收藏
页码:109355 / 109366
页数:12
相关论文
共 34 条
  • [1] Baevski Alexei, 2020, NEURIPS
  • [2] DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads
    Boza, Vladimir
    Brejova, Brona
    Vinar, Tomas
    [J]. PLOS ONE, 2017, 12 (06):
  • [3] Caron M, 2020, ADV NEUR IN, V33
  • [4] Chen T., 2020, Int. Conf. on Mach. Learn., P1597, DOI DOI 10.48550/ARXIV.2002.05709
  • [5] Simulation of Nanopore Sequencing Signals Based on BiGRU
    Chen, Weigang
    Zhang, Peng
    Song, Lifu
    Yang, Jinsheng
    Han, Changcai
    [J]. SENSORS, 2020, 20 (24) : 1 - 15
  • [6] Das A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4769, DOI 10.1109/ICASSP.2018.8461558
  • [7] Sequencing DNA with nanopores: Troubles and biases
    Delahaye, Clara
    Nicolas, Jacques
    [J]. PLOS ONE, 2021, 16 (10):
  • [8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [9] Self-Supervised Representation Learning: Introduction, advances, and challenges
    Ericsson, Linus
    Gouk, Henry
    Loy, Chen Change
    Hospedales, Timothy M.
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2022, 39 (03) : 42 - 62
  • [10] Masked Autoencoders Are Scalable Vision Learners
    He, Kaiming
    Chen, Xinlei
    Xie, Saining
    Li, Yanghao
    Dollar, Piotr
    Girshick, Ross
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15979 - 15988