MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction

被引:0
作者
Zeng, Wenhuan [1 ]
Gautam, Anupam [1 ,2 ,3 ]
Huson, Daniel H. [1 ,3 ]
机构
[1] Univ Tubingen, Inst Bioinformat & Med Informat, Algorithms Bioinformat, D-72076 Tubingen, Germany
[2] Max Planck Inst Biol Tubingen, Int Max Planck Res Sch, Mol Organisms, D-72076 Tubingen, Germany
[3] Univ Tubingen, Cluster Excellence, EXC 2124 Controlling Microbes Fight Infect, D-72076 Tubingen, Germany
来源
GIGASCIENCE | 2023年 / 12卷
关键词
DNA methylation; natural language processing; model ensemble; model explainability; web server; N4-METHYLCYTOSINE SITES;
D O I
暂无
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
引用
收藏
页数:11
相关论文
共 62 条
  • [11] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [12] Gururangan S, 2020, Arxiv, DOI arXiv:2004.10964
  • [13] Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework
    Hasan, Md Mehedi
    Basith, Shaherin
    Khatun, Mst Shamima
    Lee, Gwang
    Manavalan, Balachandran
    Kurata, Hiroyuki
    [J]. BRIEFINGS IN BIOINFORMATICS, 2021, 22 (03)
  • [14] i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes
    Hasan, Md. Mehedi
    Manavalan, Balachandran
    Shoombuatong, Watshara
    Khatun, Mst. Shamima
    Kurata, Hiroyuki
    [J]. COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2020, 18 : 906 - 912
  • [15] Tet Proteins Can Convert 5-Methylcytosine to 5-Formylcytosine and 5-Carboxylcytosine
    Ito, Shinsuke
    Shen, Li
    Dai, Qing
    Wu, Susan C.
    Collins, Leonard B.
    Swenberg, James A.
    He, Chuan
    Zhang, Yi
    [J]. SCIENCE, 2011, 333 (6047) : 1300 - 1303
  • [16] DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
    Ji, Yanrong
    Zhou, Zhihan
    Liu, Han
    Davuluri, Ramana, V
    [J]. BIOINFORMATICS, 2021, 37 (15) : 2112 - 2120
  • [17] iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations
    Jin, Junru
    Yu, Yingying
    Wang, Ruheng
    Zeng, Xin
    Pang, Chao
    Jiang, Yi
    Li, Zhongshen
    Dai, Yutong
    Su, Ran
    Zou, Quan
    Nakai, Kenta
    Wei, Leyi
    [J]. GENOME BIOLOGY, 2022, 23 (01)
  • [18] Mouse4mC-BGRU: Deep learning for predicting DNA N4-methylcytosine sites in mouse genome
    Jin, Junru
    Yu, Yingying
    Wei, Leyi
    [J]. METHODS, 2022, 204 : 258 - 262
  • [19] Comparison of non-survey techniques for constructing regional input-output tables
    Lampiris, Georgios
    Karelakis, Christos
    Loizou, Efstratios
    [J]. ANNALS OF OPERATIONS RESEARCH, 2020, 294 (1-2) : 225 - 266
  • [20] Lan ZZ, 2020, Arxiv, DOI arXiv:1909.11942