iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features

被引:9
作者
Nguyen-Vo, Thanh-Hoang [1 ]
Trinh, Quang H. [2 ]
Nguyen, Loc [1 ]
Nguyen-Hoang, Phuong-Uyen [3 ]
Rahardja, Susanto [4 ,5 ]
Nguyen, Binh P. [1 ]
机构
[1] Victoria Univ Wellington, Sch Math & Stat, Gate 7, Wellington 6140, New Zealand
[2] Hanoi Univ Sci & Technol, Sch Informat & Commun Technol, 1 Dai Co Viet, Hanoi 100000, Vietnam
[3] Internatl Univ VNU HCMC, Linh Trung Ward, Quarter 6, Ho Chi Minh City 700000, Vietnam
[4] Northwestern Polytech Univ, Sch Marine Sci & Technol, 127 West Youyi Rd, Xian 710072, Peoples R China
[5] Singapore Inst Technol, Infocomm Technol Cluster, 10 Dover Dr, Singapore 138683, Singapore
关键词
DNA; Transcription start site; Promoter; TATA-box; Bidirectional long short-term memory; TRANSCRIPTION START SITES; NEURAL-NETWORK; WEB SERVER; TATA BOX; IDENTIFICATION; GENE; REGIONS; PREDICTION; ALGORITHM; INITIATOR;
D O I
10.1186/s12864-022-08829-6
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec - an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. Results: The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. Conclusions: iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-.iPromoter-.Seqvec.
引用
收藏
页数:11
相关论文
共 69 条
  • [1] DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning
    Angermueller, Christof
    Lee, Heather J.
    Reik, Wolf
    Stegle, Oliver
    [J]. GENOME BIOLOGY, 2017, 18
  • [2] What is next generation sequencing?
    Behjati, Sam
    Tarpey, Patrick S.
    [J]. ARCHIVES OF DISEASE IN CHILDHOOD-EDUCATION AND PRACTICE EDITION, 2013, 98 (06): : 236 - 238
  • [3] PromoterPredict: sequence-based modelling of Escherichia coli σ70 promoter strength yields logarithmic dependence between promoter strength and sequence
    Bharanikumar, Ramit
    Premkumar, Keshav Aditya R.
    Palaniappan, Ashok
    [J]. PEERJ, 2018, 6
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] DeepRMethylSite: a deep learning based approach for prediction of arginine methylation sites in proteins
    Chaudhari, Meenal
    Thapa, Niraj
    Roy, Kaushik
    Newman, Robert H.
    Saigo, Hiroto
    Dukka, B. K. C.
    [J]. MOLECULAR OMICS, 2020, 16 (05) : 448 - 454
  • [6] The rise of deep learning in drug discovery
    Chen, Hongming
    Engkvist, Ola
    Wang, Yinhai
    Olivecrona, Marcus
    Blaschke, Thomas
    [J]. DRUG DISCOVERY TODAY, 2018, 23 (06) : 1241 - 1250
  • [7] iRNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide composition
    Chen, Wei
    Feng, Pengmian
    Ding, Hui
    Lin, Hao
    Chou, Kuo-Chen
    [J]. ANALYTICAL BIOCHEMISTRY, 2015, 490 : 26 - 33
  • [8] PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition
    Chen, Wei
    Lei, Tian-Yu
    Jin, Dian-Chuan
    Lin, Hao
    Chou, Kuo-Chen
    [J]. ANALYTICAL BIOCHEMISTRY, 2014, 456 : 53 - 60
  • [9] iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals
    Cheng, Xiang
    Zhao, Shu-Guang
    Xiao, Xuan
    Chou, Kuo-Chen
    [J]. ONCOTARGET, 2017, 8 (35) : 58494 - 58503
  • [10] Deep learning in video multi-object tracking: A survey
    Ciaparrone, Gioele
    Luque Sanchez, Francisco
    Tabik, Siham
    Troiano, Luigi
    Tagliaferri, Roberto
    Herrera, Francisco
    [J]. NEUROCOMPUTING, 2020, 381 : 61 - 88