BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT

被引:3
作者
Wang, Shuyu [1 ]
Liu, Yinbo [1 ]
Liu, Yufeng [1 ]
Zhang, Yong [1 ]
Zhu, Xiaolei [1 ]
机构
[1] Anhui Agr Univ, Sch Sci, Hefei, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
DNA; 5-methylcytosine; BERT; Machine learning; Natural language processing; Webserver; Fine-tuning; METHYLATION; LSTM; REPRESENTATION; RESOLUTION;
D O I
10.7717/peerj.16600
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus. BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn.
引用
收藏
页数:19
相关论文
共 50 条
[1]   DeepCDA: deep cross-domain compound-protein affinity prediction through LSTM and convolutional neural networks [J].
Abbasi, Karim ;
Razzaghi, Parvin ;
Poso, Antti ;
Amanlou, Massoud ;
Ghasemi, Jahan B. ;
Masoudi-Nejad, Ali .
BIOINFORMATICS, 2020, 36 (17) :4633-4642
[2]   DNA methylation in breast and colorectal cancers [J].
Agrawal, Anshu ;
Murphy, Richard F. ;
Agrawal, Devendra K. .
MODERN PATHOLOGY, 2007, 20 (07) :711-721
[3]   In Vivo Control of CpG and Non-CpG DNA Methylation by DNA Methyltransferases [J].
Arand, Julia ;
Spieler, David ;
Karius, Tommy ;
Branco, Miguel R. ;
Meilinger, Daniela ;
Meissner, Alexander ;
Jenuwein, Thomas ;
Xu, Guoliang ;
Leonhardt, Heinrich ;
Wolf, Verena ;
Walter, Joern .
PLOS GENETICS, 2012, 8 (06)
[4]   Clinical value of DNA methylation markers in autoimmune rheumatic diseases [J].
Ballestar, Esteban ;
Sawalha, Amr H. ;
Lu, Qianjin .
NATURE REVIEWS RHEUMATOLOGY, 2020, 16 (09) :514-524
[5]   The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity [J].
Barretina, Jordi ;
Caponigro, Giordano ;
Stransky, Nicolas ;
Venkatesan, Kavitha ;
Margolin, Adam A. ;
Kim, Sungjoon ;
Wilson, Christopher J. ;
Lehar, Joseph ;
Kryukov, Gregory V. ;
Sonkin, Dmitriy ;
Reddy, Anupama ;
Liu, Manway ;
Murray, Lauren ;
Berger, Michael F. ;
Monahan, John E. ;
Morais, Paula ;
Meltzer, Jodi ;
Korejwa, Adam ;
Jane-Valbuena, Judit ;
Mapa, Felipa A. ;
Thibault, Joseph ;
Bric-Furlong, Eva ;
Raman, Pichai ;
Shipway, Aaron ;
Engels, Ingo H. ;
Cheng, Jill ;
Yu, Guoying K. ;
Yu, Jianjun ;
Aspesi, Peter, Jr. ;
de Silva, Melanie ;
Jagtap, Kalpana ;
Jones, Michael D. ;
Wang, Li ;
Hatton, Charles ;
Palescandolo, Emanuele ;
Gupta, Supriya ;
Mahan, Scott ;
Sougnez, Carrie ;
Onofrio, Robert C. ;
Liefeld, Ted ;
MacConaill, Laura ;
Winckler, Wendy ;
Reich, Michael ;
Li, Nanxin ;
Mesirov, Jill P. ;
Gabriel, Stacey B. ;
Getz, Gad ;
Ardlie, Kristin ;
Chan, Vivien ;
Myer, Vic E. .
NATURE, 2012, 483 (7391) :603-607
[6]   Prediction of methylated CpGs in DNA sequences using a support vector machine [J].
Bhasin, M ;
Zhang, H ;
Reinherz, EL ;
Reche, PA .
FEBS LETTERS, 2005, 579 (20) :4302-4308
[7]  
Bojanowski P., 2017, Trans ACL, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, DOI 10.1162/TACL_A_00051]
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]   BiLSTM-5mC: A Bidirectional Long Short-Term Memory-Based Approach for Predicting 5-Methylcytosine Sites in Genome-Wide DNA Promoters [J].
Cheng, Xin ;
Wang, Jun ;
Li, Qianyue ;
Liu, Taigang .
MOLECULES, 2021, 26 (24)