BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection

被引:63
作者
Le, Nguyen Quoc Khanh [1 ,2 ,3 ]
Ho, Quang-Thai [4 ,5 ]
Nguyen, Van-Nui [6 ]
Chang, Jung-Su [7 ,8 ]
机构
[1] Taipei Med Univ, Coll Med, Profess Master Program Artificial Intelligence Med, Taipei 106, Taiwan
[2] Taipei Med Univ, Res Ctr Artificial Intelligence Med, Taipei 106, Taiwan
[3] Taipei Med Univ Hosp, Translat Imaging Res Ctr, Taipei 110, Taiwan
[4] Can Tho Univ, Coll Informat & Commun Technol, Can Tho, Vietnam
[5] Yuan Ze Univ, Dept Comp Sci & Engn, Chungli 32003, Taiwan
[6] Thai Nguyen Univ, Univ Informat & Commun Technol, Thai Nguyen, Vietnam
[7] Taipei Med Univ, Coll Nutr, Sch Nutr & Hlth Sci, Taipei 110, Taiwan
[8] Taipei Med Univ, Grad Inst Metab & Obes Sci, Coll Nutr, Taipei 110, Taiwan
关键词
Promoter region; Contextualized word embedding; BERT multilingual cases; EXtreme Gradient Boosting; SHAP; Explainable artificial intelligence; IDENTIFICATION; REGIONS;
D O I
10.1016/j.compbiolchem.2022.107732
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://gith ub.com/khanhlee/bert-promoter.
引用
收藏
页数:6
相关论文
共 30 条
  • [11] Structural Properties of Gene Promoters Highlight More than Two Phenotypes of Diabetes
    Ionescu-Tirgoviste, Constantin
    Gagniuc, Paul Aurelian
    Guja, Cristian
    [J]. PLOS ONE, 2015, 10 (09):
  • [12] DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
    Ji, Yanrong
    Zhou, Zhihan
    Liu, Han
    Davuluri, Ramana, V
    [J]. BIOINFORMATICS, 2021, 37 (15) : 2112 - 2120
  • [13] A novel method for prokaryotic promoter prediction based on DNA stability
    Kanhere, A
    Bansal, M
    [J]. BMC BIOINFORMATICS, 2005, 6 (1)
  • [14] Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay
    Khambata-Ford, S
    Liu, YY
    Gleason, C
    Dickson, M
    Altman, RB
    Batzoglou, S
    Myers, RM
    [J]. GENOME RESEARCH, 2003, 13 (07) : 1765 - 1774
  • [15] BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer
    Lai, Po-Ting
    Lu, Zhiyong
    [J]. BIOINFORMATICS, 2020, 36 (24) : 5678 - 5685
  • [16] A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features
    Le Nguyen Quoc Khanh
    Do Duyen Thi
    Nguyen Trinh-Trung-Duong
    Le Quynh Anh
    [J]. GENE, 2021, 787
  • [17] A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information
    Le, Nguyen Quoc Khanh
    Ho, Quang-Thai
    Nguyen, Trinh-Trung-Duong
    Ou, Yu-Yen
    [J]. BRIEFINGS IN BIOINFORMATICS, 2021, 22 (05)
  • [18] BioBERT: a pre-trained biomedical language representation model for biomedical text mining
    Lee, Jinhyuk
    Yoon, Wonjin
    Kim, Sungdong
    Kim, Donghyeon
    Kim, Sunkyu
    So, Chan Ho
    Kang, Jaewoo
    [J]. BIOINFORMATICS, 2020, 36 (04) : 1234 - 1240
  • [19] Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition
    Lin, Hao
    Liang, Zhi-Yong
    Tang, Hua
    Chen, Wei
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2019, 16 (04) : 1316 - 1321
  • [20] Lundberg SM, 2017, ADV NEUR IN, V30