MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction

被引:0
作者
Zeng, Wenhuan [1 ]
Gautam, Anupam [1 ,2 ,3 ]
Huson, Daniel H. [1 ,3 ]
机构
[1] Univ Tubingen, Inst Bioinformat & Med Informat, Algorithms Bioinformat, D-72076 Tubingen, Germany
[2] Max Planck Inst Biol Tubingen, Int Max Planck Res Sch, Mol Organisms, D-72076 Tubingen, Germany
[3] Univ Tubingen, Cluster Excellence, EXC 2124 Controlling Microbes Fight Infect, D-72076 Tubingen, Germany
来源
GIGASCIENCE | 2023年 / 12卷
关键词
DNA methylation; natural language processing; model ensemble; model explainability; web server; N4-METHYLCYTOSINE SITES;
D O I
暂无
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
引用
收藏
页数:11
相关论文
共 62 条
  • [21] Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes
    Le, Nguyen Quoc Khanh
    Ho, Quang-Thai
    [J]. METHODS, 2022, 204 : 199 - 206
  • [22] BioBERT: a pre-trained biomedical language representation model for biomedical text mining
    Lee, Jinhyuk
    Yoon, Wonjin
    Kim, Sungdong
    Kim, Donghyeon
    Kim, Sunkyu
    So, Chan Ho
    Kang, Jaewoo
    [J]. BIOINFORMATICS, 2020, 36 (04) : 1234 - 1240
  • [23] Deep6mA: A deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species
    Li, Zutan
    Jiang, Hangjin
    Kong, Lingpeng
    Chen, Yuanyuan
    Lang, Kun
    Fan, Xiaodan
    Zhang, Liangyun
    Pian, Cong
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2021, 17 (02)
  • [24] MAGCNSE: predicting lncRNA-disease associations using multi-view attention graph convolutional network and stacking ensemble model
    Liang, Ying
    Zhang, Ze-Qun
    Liu, Nian-Nian
    Wu, Ya-Nan
    Gu, Chang-Long
    Wang, Ying-Long
    [J]. BMC BIOINFORMATICS, 2022, 23 (01)
  • [25] Liu M., 2022, Brief Bioinform, V23, pbbac082
  • [26] DeepTorrent: a deep learning-based approach for predicting DNA N4-methylcytosine sites
    Liu, Quanzhong
    Chen, Jinxiang
    Wang, Yanze
    Li, Shuqin
    Jia, Cangzhi
    Song, Jiangning
    Li, Fuyi
    [J]. BRIEFINGS IN BIOINFORMATICS, 2021, 22 (03)
  • [27] Liu YH, 2019, Arxiv, DOI [arXiv:1907.11692, DOI 10.48550/ARXIV.1907.11692]
  • [28] Protein language models trained on multiple sequence alignments learn phylogenetic relationships
    Lupo, Umberto
    Sgarbossa, Damiano
    Bitbol, Anne-Florence
    [J]. NATURE COMMUNICATIONS, 2022, 13 (01)
  • [29] Lv Hao, 2023, GigaDB, DOI 10.5524/102395
  • [30] iDNA-MS: An Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes
    Lv, Hao
    Dao, Fu-Ying
    Zhang, Dan
    Guan, Zheng-Xing
    Yang, Hui
    Su, Wei
    Liu, Meng-Lu
    Ding, Hui
    Chen, Wei
    Lin, Hao
    [J]. ISCIENCE, 2020, 23 (04)