A Comparison of Mutual Information, Linear Models and Deep Learning Networks for Protein Secondary Structure Prediction

被引:3
作者
Mahmoud, Saida Saad Mohamed [1 ,2 ]
Portelli, Beatrice [1 ]
D'Agostino, Giovanni [1 ]
Pollastri, Gianluca [3 ]
Serra, Giuseppe [1 ,4 ]
Fogolari, Federico [1 ,4 ]
机构
[1] Univ Udine, Dept Math Comp Sci & Phys, Udine, Italy
[2] Cairo Univ, Fac Sci, Cairo, Egypt
[3] Univ Coll Dublin, Sch Comp Sci, Dublin, Ireland
[4] Univ Udine, Dept Math Comp Sci & Phys, Via Sci 206, Udine, Italy
关键词
Secondary structure prediction; single sequence; mutual information; linear model; deep learning; neuralnetwork; LSTM; BERT; RECURRENT NEURAL-NETWORKS; ORDER;
D O I
10.2174/1574893618666230417103346
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Over the last several decades, predicting protein structures from amino acid sequences has been a core task in bioinformatics. Nowadays, the most successful methods employ multiple sequence alignments and can predict the structure with excellent performance. These predictions take advantage of all the amino acids at a given position and their frequencies. However, the effect of single amino acid substitutions in a specific protein tends to be hidden by the alignment profile. For this reason, single-sequence-based predictions attract interest even after accurate multiple-alignment methods have become available: the use of single sequences ensures that the effects of substitution are not confounded by homologous sequences.Objective This work aims at understanding how the single-sequence secondary structure prediction of a residue is influenced by the surrounding ones. We aim at understanding how different prediction methods use single-sequence information to predict the structure.Methods We compare mutual information, the coefficients of two linear models, and three deep learning networks. For the deep learning algorithms, we use the DeepLIFT analysis to assess the effect of each residue at each position in the prediction.Results Mutual information and linear models quantify direct effects, whereas DeepLIFT applied on deep learning networks quantifies both direct and indirect effects.Conclusion Our analysis shows how different network architectures use the information of single protein sequences and highlights their differences with respect to linear models. In particular, the deep learning implementations take into account context and single position information differently, with the best results obtained using the BERT architecture.
引用
收藏
页码:631 / 646
页数:16
相关论文
共 44 条
[1]   PRINCIPLES THAT GOVERN FOLDING OF PROTEIN CHAINS [J].
ANFINSEN, CB .
SCIENCE, 1973, 181 (4096) :223-230
[2]  
Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
[3]  
Benesty J., 2009, Noise reduction in speech processing, P1
[4]  
Chalkidis I, 2020, M ASS FOR COMPUTATIO
[5]   PREDICTION OF PROTEIN CONFORMATION [J].
CHOU, PY ;
FASMAN, GD .
BIOCHEMISTRY, 1974, 13 (02) :222-245
[6]  
Chowdhury R, 2022, NAT BIOTECHNOL, V40, P1617, DOI 10.1038/s41587-022-01432-w
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Feng ZY, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P1536
[9]   ANALYSIS OF ACCURACY AND IMPLICATIONS OF SIMPLE METHODS FOR PREDICTING SECONDARY STRUCTURE OF GLOBULAR PROTEINS [J].
GARNIER, J ;
OSGUTHORPE, DJ ;
ROBSON, B .
JOURNAL OF MOLECULAR BIOLOGY, 1978, 120 (01) :97-120
[10]  
Garnier J, 1996, METHOD ENZYMOL, V266, P540