Predicting Gene Ontology functions based on support vector machines and statistical significance estimation

被引:12
作者
Bi, Ran [1 ]
Zhou, Yanhong [1 ]
Lu, Feng [1 ]
Wang, Weiqiang [1 ]
机构
[1] Huazhong Univ Sci & Technol, Hubei Bioinformat & Mol Imaging Key Lab, Wuhan 430074, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
protein function; Gene Ontology; support vector machines; statistical significance;
D O I
10.1016/j.neucom.2006.10.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Gene Ontology (GO) is a common language for the functional annotation of gene products. We have developed a computational tool, GOKey, to predict the GO function of proteins based on their sequence features and the support vector machine (SVM) method. Several measures, including improved handling of the problem caused by unbalanced positive and negative training data and postprocessing strategies to evaluate the posterior probability and statistical significance of SVM outputs, have been adopted to improve the prediction performance of GOKey. The GOKey has been trained to predict the 36 GO categories of the 'molecular function' of GO slims, and could be easily extended to other GO categories. The results of 5-fold cross validation with 10,603 GO-mapped proteins demonstrate that the performance of GOKey is better than that of standard SVMs. Comparisons with other computational tools for GO function prediction also show that the performance of GOKey is satisfactory. Further, GOKey has been applied to predict the GO functions for 5381 novel human proteins in the Ensembl database. The results show that 93% of the novel proteins can be assigned one or more GO terms, and some evidences supporting the predictions have been found. GOKey can be accessed at http://infosci.hust.edu.cn. (c) 2006 Published by Elsevier B.V.
引用
收藏
页码:718 / 725
页数:8
相关论文
共 33 条
[21]   A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach [J].
Hua, SJ ;
Sun, ZR .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 308 (02) :397-407
[22]   Ensembl 2005 [J].
Hubbard, T ;
Andrews, D ;
Caccamo, M ;
Cameron, G ;
Chen, Y ;
Clamp, M ;
Clarke, L ;
Coates, G ;
Cox, T ;
Cunningham, F ;
Curwen, V ;
Cutts, T ;
Down, T ;
Durbin, R ;
Fernandez-Suarez, XM ;
Gilbert, J ;
Hammond, M ;
Herrero, J ;
Hotz, H ;
Howe, K ;
Iyer, V ;
Jekosch, K ;
Kahari, A ;
Kasprzyk, A ;
Keefe, D ;
Keenan, S ;
Kokocinsci, F ;
London, D ;
Longden, I ;
McVicker, G ;
Melsopp, C ;
Meidl, P ;
Potter, S ;
Proctor, G ;
Rae, M ;
Rios, D ;
Schuster, M ;
Searle, S ;
Severin, J ;
Slater, G ;
Smedley, D ;
Smith, J ;
Spooner, W ;
Stabenau, A ;
Stalker, J ;
Storey, R ;
Trevanion, S ;
Ureta-Vidal, A ;
Vogel, J ;
White, S .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D447-D453
[23]   Prediction of human protein function according to Gene Ontology categories [J].
Jensen, LJ ;
Gupta, R ;
Stærfeldt, HH ;
Brunak, S .
BIOINFORMATICS, 2003, 19 (05) :635-642
[24]   METHODS FOR ASSESSING THE STATISTICAL SIGNIFICANCE OF MOLECULAR SEQUENCE FEATURES BY USING GENERAL SCORING SCHEMES [J].
KARLIN, S ;
ALTSCHUL, SF .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (06) :2264-2268
[25]   GoFigure:: Automated gene Ontology™ annotation [J].
Khan, S ;
Situ, G ;
Decker, K ;
Schmidt, CJ .
BIOINFORMATICS, 2003, 19 (18) :2484-2485
[26]   Predicting protein function from protein/protein interaction data: a probabilistic approach [J].
Letovsky, Stanley ;
Kasif, Simon .
BIOINFORMATICS, 2003, 19 :i197-i204
[27]  
Platt JC, 2000, ADV NEUR IN, P61
[28]   Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data [J].
Qian, J ;
Lin, J ;
Luscombe, NM ;
Yu, HY ;
Gerstein, M .
BIOINFORMATICS, 2003, 19 (15) :1917-1926
[29]   Computational methods of analysis of protein-protein interactions [J].
Salwinski, L ;
Eisenberg, D .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 2003, 13 (03) :377-382
[30]   Predicting Gene Ontology functions from ProDom and CDD protein domains [J].
Schug, J ;
Diskin, S ;
Mazzarelli, J ;
Brunk, BP ;
Stoeckert, CJ .
GENOME RESEARCH, 2002, 12 (04) :648-655