共 30 条
Predicting the Protein Folding Rate Based on Sequence Feature Screening and Support Vector Regression
被引:0
作者:
Li Yong
Zhou Wei
Dai Zhi-Jun
Chen Yuan
Wang Zhi-Ming
Yuan Zhe-Ming
[1
]
机构:
[1] Hunan Agr Univ, Hunan Prov Key Lab Crop Germplasm Innovat & Utili, Changsha 410128, Hunan, Peoples R China
基金:
中国国家自然科学基金;
关键词:
Protein folding;
Folding rate prediction;
High-dimensional feature;
Feature screening;
Support vector regression;
HIGH-DIMENSIONAL DATA;
AMINO-ACID-SEQUENCE;
FEATURE-SELECTION;
CONTACT ORDER;
ALGORITHM;
D O I:
10.3866/PKU.WHXB201404091
中图分类号:
O64 [物理化学(理论化学)、化学物理学];
学科分类号:
070304 ;
081704 ;
摘要:
Folding rate prediction plays an important role in clarifying the protein folding mechanism. In this work, we collected 115 protein samples with known folding rates including two-, multi-, and mixed-state proteins. To characterize the primary structure information of the protein molecules more comprehensively, we considered sequence length, residue components with different scales, k-space features for pair residues, and geostatistics association features among different locations of the residues substituted with corresponding physical-chemical properties. Each protein sequence was represented by a numeric vector containing 9357 numbers. We selected 23 features with a clear meaning from the above-mentioned high-dimensional features for each sample, after conducting an improved binary matrix shuffling filter and a worst descriptor elimination multi-round method. We constructed a nonlinear support vector regression (SVR) model based on the folding rate and the 23 retained features. The correlation coefficient of the Jackknife cross validation was 0.95. Our prediction accuracy was superior to other results from the literature and other reference feature selection methods. Finally, we established an interpretability system for SVR, and our data showed that the nonlinear regression relationship between the folding rates and the reserved features was highly significant. By further analyzing the effects of each retained descriptor on protein folding rates, the results showed that the protein folding rate might be closely related to the sequence length, the features associated with the medium-and short-range, the triplet residues component features, etc.
引用
收藏
页码:1091 / 1098
页数:8
相关论文