Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset

被引:71
作者
Shi, Ming-Guang [1 ,2 ,3 ]
Xia, Jun-Feng [1 ,4 ]
Li, Xue-Ling [1 ]
Huang, De-Shuang [1 ]
机构
[1] Chinese Acad Sci, Hefei Inst Intelligent Machines, Intelligent Comp Lab, Hefei 230031, Peoples R China
[2] Univ Sci & Technol China, Dept Automat, Hefei 230026, Peoples R China
[3] Hefei Univ Technol, Sch Elect Engn & Automat, Hefei 230009, Peoples R China
[4] Univ Sci & Technol China, Sch Life Sci, Hefei 230026, Peoples R China
基金
美国国家科学基金会; 国家高技术研究发展计划(863计划);
关键词
Protein-protein interactions; Correlation coefficient; Support vector machine; Protein sequence; Gold standard positives dataset; Gold standard negatives dataset; SACCHAROMYCES-CEREVISIAE; SEMANTIC SIMILARITY; INTERACTION MAP; INTERACTION NETWORK; COMPONENT ANALYSIS; GLOBULAR-PROTEINS; AMINO-ACIDS; YEAST; SCALE; HYDROPHOBICITIES;
D O I
10.1007/s00726-009-0295-y
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.
引用
收藏
页码:891 / 899
页数:9
相关论文
共 57 条