DROP: an SVM domain linker predictor trained with optimal features selected by random forest

被引:50
作者
Ebina, Teppei [2 ]
Toh, Hiroyuki [1 ]
Kuroda, Yutaka [2 ]
机构
[1] AIST Tokyo, Computat Biol Res Ctr, Koto Ku, Tokyo 1350064, Japan
[2] Tokyo Univ Agr & Technol, Dept Biotechnol & Life Sci, Koganei, Tokyo 1848588, Japan
关键词
BOUNDARY PREDICTION; SECONDARY-STRUCTURE; PROTEINS; REGIONS;
D O I
10.1093/bioinformatics/btq700
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Biologically important proteins are often large, multidomain proteins, which are difficult to characterize by high-throughput experimental methods. Efficient domain/boundary predictions are thus increasingly required in diverse area of proteomics research for computationally dissecting proteins into readily analyzable domains. Results: We constructed a support vector machine (SVM)-based domain linker predictor, DROP (Domain linker pRediction using OPtimal features), which was trained with 25 optimal features. The optimal combination of features was identified from a set of 3000 features using a random forest algorithm complemented with a stepwise feature selection. DROP demonstrated a prediction sensitivity and precision of 41.3 and 49.4%, respectively. These values were over 19.9% higher than those of control SVM predictors trained with non-optimized features, strongly suggesting the efficiency of our feature selection method. In addition, the mean NDO-Score of DROP for predicting novel domains in seven CASP8 FM multidomain proteins was 0.760, which was higher than any of the 12 published CASP8 DP servers. Overall, these results indicate that the SVM prediction of domain linkers can be improved by identifying optimal features that best distinguish linker from non-linker regions.
引用
收藏
页码:487 / 494
页数:8
相关论文
共 39 条
[1]  
[Anonymous], 1999, Advances in kernel methods: Support vector learning
[2]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[3]   Target selection for structural genomics [J].
Brenner, SE .
NATURE STRUCTURAL BIOLOGY, 2000, 7 (Suppl 11) :967-969
[4]   Mathematical model for empirically optimizing large scale production of soluble protein domains [J].
Chikayama, Eisuke ;
Kurotani, Atsushi ;
Tanaka, Takanori ;
Yabuki, Takashi ;
Miyazaki, Satoshi ;
Yokoyama, Shigeyuki ;
Kuroda, Yutaka .
BMC BIOINFORMATICS, 2010, 11
[5]  
Chou P Y, 1978, Adv Enzymol Relat Areas Mol Biol, V47, P45
[6]   Structural proteomics: prospects for high throughput sample preparation [J].
Christendat, D ;
Yee, A ;
Dharamsi, A ;
Kluger, Y ;
Gerstein, M ;
Arrowsmith, CH ;
Edwards, AM .
PROGRESS IN BIOPHYSICS & MOLECULAR BIOLOGY, 2000, 73 (05) :339-345
[7]   Prediction of unfolded segments in a protein sequence based on amino acid composition [J].
Coeytaux, K ;
Poupon, A .
BIOINFORMATICS, 2005, 21 (09) :1891-1900
[8]   Armadillo: Domain boundary prediction by amino acid composition [J].
Dumontier, M ;
Yao, R ;
Feldman, HJ ;
Hogue, CWV .
JOURNAL OF MOLECULAR BIOLOGY, 2005, 350 (05) :1061-1073
[9]   Loop-Length-Dependent SVM Prediction of Domain Linkers for High-Throughput Structural Proteomics [J].
Ebina, Teppei ;
Toh, Hiroyuki ;
Kuroda, Yutaka .
BIOPOLYMERS, 2009, 92 (01) :1-8
[10]   Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 [J].
Ezkurdia, Iakes ;
Grana, Osvaldo ;
Izarzugaza, Jose M. G. ;
Tress, Michael L. .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2009, 77 :196-209