Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features

被引:10
作者
Li, Zhan-Chao [1 ,2 ]
Lai, Yan-Hua [1 ]
Chen, Li-Li [1 ]
Zhou, Xuan [2 ]
Dai, Zong [1 ]
Zou, Xiao-Yong [1 ]
机构
[1] Sun Yat Sen Univ, Sch Chem & Chem Engn, Guangzhou 510275, Guangdong, Peoples R China
[2] Guangdong Pharmaceut Univ, Sch Chem & Chem Engn, Guangzhou 510006, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Protein complexes; Random forest; Protein-protein interaction network; Topological structures; Gene Ontology; SUPPORT VECTOR MACHINE; ATTACHMENT BASED METHOD; FUNCTIONAL MODULES; PREDICTION; CLASSIFICATION; INTEGRATION; DISCOVERY; SOFTWARE; RESOURCE;
D O I
10.1016/j.aca.2011.12.069
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
In the post-genomic era, one of the most important and challenging tasks is to identify protein complexes and further elucidate its molecular mechanisms in specific biological processes. Previous computational approaches usually identify protein complexes from protein interaction network based on dense sub-graphs and incomplete priori information. Additionally, the computational approaches have little concern about the biological properties of proteins and there is no a common evaluation metric to evaluate the performance. So, it is necessary to construct novel method for identifying protein complexes and elucidating the function of protein complexes. In this study, a novel approach is proposed to identify protein complexes using random forest and topological structure. Each protein complex is represented by a graph of interactions, where descriptor of the protein primary structure is used to characterize biological properties of protein and vertex is weighted by the descriptor. The topological structure features are developed and used to characterize protein complexes. Random forest algorithm is utilized to build prediction model and identify protein complexes from local sub-graphs instead of dense sub-graphs. As a demonstration, the proposed approach is applied to protein interaction data in human, and the satisfied results are obtained with accuracy of 80.24%, sensitivity of 81.94%, specificity of 80.07%, and Matthew's correlation coefficient of 0.4087 in 10-fold cross-validation test. Some new protein complexes are identified, and analysis based on Gene Ontology shows that the complexes are likely to be true complexes and play important roles in the pathogenesis of some diseases. PCI-RFTS, a corresponding executable program for protein complexes identification, can be acquired freely on request from the authors. (C) 2012 Elsevier BM. All rights reserved.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 53 条
  • [1] Data integration and network reconstruction with ∼omics data using Random Forest regression in potato
    Acharjee, Animesh
    Kloosterman, Bjorn
    de Vos, Ric C. H.
    Werij, Jeroen S.
    Bachem, Christian W. B.
    Visser, Richard G. F.
    Maliepaard, Chris
    [J]. ANALYTICA CHIMICA ACTA, 2011, 705 (1-2) : 56 - 63
  • [2] Amin M.A.U., 2006, BMC BIOINFORMATICS, V7, P207
  • [3] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [4] An automated method for finding molecular complexes in large protein interaction networks
    Bader, GD
    Hogue, CW
    [J]. BMC BIOINFORMATICS, 2003, 4 (1)
  • [5] Classification of nuclear receptors based on amino acid composition and dipeptide composition
    Bhasin, M
    Raghava, GPS
    [J]. JOURNAL OF BIOLOGICAL CHEMISTRY, 2004, 279 (22) : 23262 - 23266
  • [6] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [7] SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence
    Cai, CZ
    Han, LY
    Ji, ZL
    Chen, X
    Chen, YZ
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (13) : 3692 - 3697
  • [8] In silico classification of human maximum recommended daily dose based on modified random forest and substructure fingerprint
    Cao, Dong-Sheng
    Hu, Qian-Nan
    Xu, Qing-Song
    Yang, Yan-Ning
    Zhao, Jian-Chao
    Lu, Hong-Mei
    Zhang, Liang-Xiao
    Liang, Yi-Zeng
    [J]. ANALYTICA CHIMICA ACTA, 2011, 692 (1-2) : 50 - 56
  • [9] Identifying Protein Complexes Using Hybrid Properties
    Chen, Lei
    Shi, Xiaohe
    Kong, Xiangyin
    Zeng, Zhenbing
    Cai, Yu-Dong
    [J]. JOURNAL OF PROTEOME RESEARCH, 2009, 8 (11) : 5212 - 5218
  • [10] Prediction of protein-protein interactions using random decision forest framework
    Chen, XW
    Liu, M
    [J]. BIOINFORMATICS, 2005, 21 (24) : 4394 - 4400