Self-Supervised Representation Learning for Basecalling Nanopore Sequencing Data

被引:0
作者
Vintimilla, Carlos [1 ]
Hwang, Sangheum [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea
[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
DNA; Task analysis; Sequential analysis; Data models; Contrastive learning; Accuracy; Representation learning; Basecalling; nanopore sequencing; self-supervised learning; representation learning; wav2vec2.0;
D O I
10.1109/ACCESS.2024.3440882
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.
引用
收藏
页码:109355 / 109366
页数:12
相关论文
共 34 条
  • [31] MSRCall: a multi-scale deep neural network to basecall Oxford Nanopore sequences
    Yeh, Yang-Ming
    Lu, Yi-Chang
    [J]. BIOINFORMATICS, 2022, 38 (16) : 3877 - 3884
  • [32] Yu H, 2022, Arxiv, DOI [arXiv:2210.07340, 10.48550/arXiv.2210.07340, DOI 10.48550/ARXIV.2210.07340]
  • [33] Causalcall: Nanopore Basecalling Using a Temporal Convolutional Network
    Zeng, Jingwen
    Cai, Hongmin
    Peng, Hong
    Wang, Haiyan
    Zhang, Yue
    Akutsu, Tatsuya
    [J]. FRONTIERS IN GENETICS, 2020, 10
  • [34] The newest Oxford Nanopore R10.4.1 full-length 16S rRNA sequencing enables the accurate resolution of species-level microbial community profiling
    Zhang, Tianyuan
    Li, Hanzhou
    Ma, Silin
    Cao, Jian
    Liao, Hao
    Huang, Qiaoyun
    Chen, Wenli
    [J]. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2023, 89 (10)