A self-supervised deep learning method for data-efficient training in genomics

被引:0
作者
Hüseyin Anil Gündüz
Martin Binder
Xiao-Yin To
René Mreches
Bernd Bischl
Alice C. McHardy
Philipp C. Münch
Mina Rezaei
机构
[1] LMU Munich,Department of Statistics
[2] Munich Center for Machine Learning,Department for Computational Biology of Infection Research
[3] Helmholtz Center for Infection Research,Braunschweig Integrated Centre of Systems Biology (BRICS)
[4] Technische Universität Braunschweig,German Center for Infection Research (DZIF)
[5] partner site Hannover Braunschweig,Department of Biostatistics
[6] Harvard School of Public Health,undefined
来源
Communications Biology | / 6卷
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.
引用
收藏
相关论文
共 20 条
[1]  
Gligorijević V(2021)Structure-based protein function prediction using graph convolutional networks Nat. Commun. 12 D67-D72
[2]  
Ciortan M(2021)Contrastive self-supervised clustering of scRNA-seq data BMC Bioinforma. 22 D733-D745
[3]  
Defrance M(2016)GenBank Nucleic. Acids Res. 44 2196-2202
[4]  
Clark K(2016)Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation Nucleic. Acids Res. 44 931-934
[5]  
Karsch-Mizrachi I(2015)SecReT6: a web-based resource for type VI secretion systems found in bacteria Environ. Microbiol. 17 e107-47
[6]  
Lipman DJ(2015)Predicting effects of noncoding variants with deep learning-based sequence model Nat. Methods 12 40-750
[7]  
Ostell J(2016)DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences Nucleic. Acids Res. 44 739-4
[8]  
Sayers EW(2018)Deep learning models for bacteria taxonomic classification of metagenomic data BMC Bioinforma. 19 1-undefined
[9]  
O’Leary NA(2019)FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data Methods 166 undefined-undefined
[10]  
Li J(2018)Sequential regulatory activity prediction across chromosomes with convolutional neural networks Genome. Res. 28 undefined-undefined