BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection

被引:63
作者
Le, Nguyen Quoc Khanh [1 ,2 ,3 ]
Ho, Quang-Thai [4 ,5 ]
Nguyen, Van-Nui [6 ]
Chang, Jung-Su [7 ,8 ]
机构
[1] Taipei Med Univ, Coll Med, Profess Master Program Artificial Intelligence Med, Taipei 106, Taiwan
[2] Taipei Med Univ, Res Ctr Artificial Intelligence Med, Taipei 106, Taiwan
[3] Taipei Med Univ Hosp, Translat Imaging Res Ctr, Taipei 110, Taiwan
[4] Can Tho Univ, Coll Informat & Commun Technol, Can Tho, Vietnam
[5] Yuan Ze Univ, Dept Comp Sci & Engn, Chungli 32003, Taiwan
[6] Thai Nguyen Univ, Univ Informat & Commun Technol, Thai Nguyen, Vietnam
[7] Taipei Med Univ, Coll Nutr, Sch Nutr & Hlth Sci, Taipei 110, Taiwan
[8] Taipei Med Univ, Grad Inst Metab & Obes Sci, Coll Nutr, Taipei 110, Taiwan
关键词
Promoter region; Contextualized word embedding; BERT multilingual cases; EXtreme Gradient Boosting; SHAP; Explainable artificial intelligence; IDENTIFICATION; REGIONS;
D O I
10.1016/j.compbiolchem.2022.107732
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://gith ub.com/khanhlee/bert-promoter.
引用
收藏
页数:6
相关论文
共 30 条
  • [1] Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment
    Bajic, Vladimir B.
    Brent, Michael R.
    Brown, Randall H.
    Frankish, Adam
    Harrow, Jennifer
    Ohler, Uwe
    Solovyev, Victor V.
    Tan, Sin Lam
    [J]. GENOME BIOLOGY, 2006, 7 (Suppl 1)
  • [2] BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides
    Charoenkwan, Phasit
    Nantasenamat, Chanin
    Hasan, Md Mehedi
    Manavalan, Balachandran
    Shoombuatong, Watshara
    [J]. BIOINFORMATICS, 2021, 37 (17) : 2556 - 2562
  • [3] iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data
    Chen, Zhen
    Zhao, Pei
    Li, Fuyi
    Marquez-Lago, Tatiana T.
    Leier, Andre
    Revote, Jerico
    Zhu, Yan
    Powell, David R.
    Akutsu, Tatsuya
    Webb, Geoffrey, I
    Chou, Kuo-Chen
    Smith, A. Ian
    Daly, Roger J.
    Li, Jian
    Song, Jiangning
    [J]. BRIEFINGS IN BIOINFORMATICS, 2020, 21 (03) : 1047 - 1057
  • [4] Computational identification of promoters and first exons in the human genome
    Davuluri, RV
    Grosse, I
    Zhang, MQ
    [J]. NATURE GENETICS, 2001, 29 (04) : 412 - 417
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features
    Duyen Thi Do
    Nguyen Quoc Khanh Le
    [J]. GENOMICS, 2020, 112 (03) : 2445 - 2451
  • [7] Gade P, 2012, METHODS MOL BIOL, V809, P85, DOI 10.1007/978-1-61779-376-9_6
  • [8] RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond
    Gama-Castro, Socorro
    Salgado, Heladia
    Santos-Zavaleta, Alberto
    Ledezma-Tejeida, Daniela
    Muniz-Rascado, Luis
    Santiago Garcia-Sotelo, Jair
    Alquicira-Hernandez, Kevin
    Martinez-Flores, Irma
    Pannier, Lucia
    Castro-Mondragon, Jaime Abraham
    Medina-Rivera, Alejandra
    Solano-Lira, Hilda
    Bonavides-Martinez, Cesar
    Perez-Rueda, Ernesto
    Alquicira-Hernandez, Shirley
    Porron-Sotelo, Liliana
    Lopez-Fuentes, Alejandra
    Hernandez-Koutoucheva, Anastasia
    Del Moral-Chavez, Victor
    Rinaldi, Fabio
    Collado-Vides, Julio
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) : D133 - D143
  • [9] Sequence alignment kernel for recognition of promoter regions
    Gordon, L
    Chervonenkis, AY
    Gammerman, AJ
    Shahmuradov, IA
    Solovyev, VV
    [J]. BIOINFORMATICS, 2003, 19 (15) : 1964 - 1971
  • [10] Interleukin-10 and transforming growth factor-β promoter polymorphisms in allergies and asthma
    Hobbs, K
    Negri, J
    Klinnert, M
    Rosenwasser, LJ
    Borish, L
    [J]. AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE, 1998, 158 (06) : 1958 - 1962