Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers

被引:8
作者
Pipoli, Vittorio [2 ]
Cappelli, Mattia [1 ]
Palladini, Alessandro [1 ]
Peluso, Carlo [1 ]
Lovino, Marta [2 ]
Ficarra, Elisa [2 ]
机构
[1] Dept Control & Comp Engn, Corso Duca Abruzzi 24, I-10129 Turin, Piedmont, Italy
[2] Univ Modena & Reggio Emilia, Enzo Ferrari Engn Dept, Via P Vivarelli 10, I-41125 Modena, Emilia Romagna, Italy
基金
欧盟地平线“2020”;
关键词
Attention; DNA; Gene-expression; Prediction; Transcription-factors; Transformers;
D O I
10.1016/j.cmpb.2022.107035
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background and objectives: In the latest years, the prediction of gene expression levels has been crucial due to its potential applications in the clinics. In this context, Xpresso and others methods based on Convolutional Neural Networks and Transformers were firstly proposed to this aim. However, all these methods embed data with a standard one-hot encoding algorithm, resulting in impressively sparse matrices. In addition, post-transcriptional regulation processes, which are of uttermost importance in the gene expression process, are not considered in the model. Methods: This paper presents Transformer DeepLncLoc, a novel method to predict the abundance of the mRNA (i.e., gene expression levels) by processing gene promoter sequences, managing the problem as a regression task. The model exploits a transformer-based architecture, introducing the DeepLncLoc method to perform the data embedding. Since DeepLncloc is based on word2vec algorithm, it avoids the sparse matrices problem. Results: Post-transcriptional information related to mRNA stability and transcription factors is included in the model, leading to significantly improved performances compared to the state-of-the-art works. Transformer DeepLncLoc reached 0.76 of R-2 evaluation metric compared to 0.74 of Xpresso. Conclusion: The Multi-Headed Attention mechanisms which characterizes the transformer methodology is suitable for modeling the interactions between DNA's locations, overcoming the recurrent models. Finally, the integration of the transcription factors data in the pipeline leads to impressive gains in predictive power. (C) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:9
相关论文
共 28 条
[1]   refTSS: A Reference Data Set for Human and Mouse Transcription Start Sites [J].
Abugessaisa, Imad ;
Noguchi, Shuhei ;
Hasegawa, Akira ;
Kondo, Atsushi ;
Kawaji, Hideya ;
Carninci, Piero ;
Kasukawa, Takeya .
JOURNAL OF MOLECULAR BIOLOGY, 2019, 431 (13) :2407-2422
[2]  
Agarwal V, 2018, bioRxiv, DOI [10.1101/416685, 10.1101/416685v1, DOI 10.1101/416685V1]
[3]  
[Anonymous], 2013, P WORKSH ICLR 2013 S
[4]  
[Anonymous], DEFINITION GC CONTEN
[5]  
[Anonymous], ENCY DNA EL ENCODE P, DOI [10.1126/science.1105136, DOI 10.1126/SCIENCE.1105136]
[6]  
Avsec Z., 2021, bioRxiv, DOI [DOI 10.1101/2021.04.07.438649V1, DOI 10.1101/2021.04.07.438649]
[7]  
Brody L.C, STOP CODON
[8]   MESSENGER RNA HALF-LIFE MEASUREMENTS IN MAMMALIAN CELLS [J].
Chen, Chyi-Ying A. ;
Ezzeddine, Nader ;
Shyu, Ann-Bin .
RNA TURNOVER IN EUKARYOTES: NUCLEASES, PATHWAYS AND ANAYLSIS OF MRNA DECAY, 2008, 448 :335-357
[9]   Genomic DNA k-mer spectra: models and modalities [J].
Chor, Benny ;
Horn, David ;
Goldman, Nick ;
Levy, Yaron ;
Massingham, Tim .
GENOME BIOLOGY, 2009, 10 (10)
[10]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]