An Empirical Study of Features Fusion Techniques for Protein-Protein Interaction Prediction

被引:52
作者
Zeng, Jiancang [1 ]
Li, Dapeng [2 ]
Wu, Yunfeng [1 ]
Zou, Quan [3 ]
Liu, Xiangrong [1 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Engn, Xiamen, Peoples R China
[2] Fourth Hosp Qinhuangdao, Dept Internal Med Oncol, Qinhuangdao, Peoples R China
[3] Tianjin Univ, Sch Comp Sci & Technol, Tianjin 300354, Peoples R China
关键词
Features fusion; features selection; Random Forests; protein-protein interaction; INTEGRATED RESOURCE; SIGNALING NETWORKS; FEATURE-SELECTION; IDENTIFICATION; INFORMATION; DATABASE; GRAM; RNA;
D O I
10.2174/1574893611666151119221435
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
With recent development of bioinformatics, the importance of understanding protein function has been widely acknowledged. Most proteins perform their functions by interacting with other proteins. Hence, it is urgent to explore the protein-protein interaction (PPI). At present, the prediction of PPIs is still a tough problem. Despite the fact that a variety of computational methods have been proposed to identify PPIs; unfortunately, most of them are complex and with low accuracy. Traditional methods extract features following two steps: firstly, they extract features from two proteins of a PPI; secondly, they regard two features as strings, and do concatenation operator. Concatenation is an outcome of an addition operation on strings. The concatenation operator increases redundancy features with the result being associated with the order of concatenation. Based on this, in this paper, we study the features fusion and features selection. The presented framework consists of three stages: in the first stage, we get the negative data set from off-the-shelf database. The reliability of negative data set of previous studies has not been of concern to us. While in the second stage, the n-gram frequency method was used to preprocess the PPIs sequences. The third one was applied to splice the final feature, and then the features were selected to find the optimal feature. Finally, an effective parameter for the Random Forest Classifier was selected. Experiments carried out on real data set showed that our features fusion method outperformed traditional methods in terms of protein-protein interaction prediction. The encouraging results can be helpful for future research in protein function.
引用
收藏
页码:4 / 12
页数:9
相关论文
共 39 条
  • [1] Blohm P, 2013, NUCLEIC ACIDS RES, V2013
  • [2] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [3] Prediction of lysine ubiquitination with mRMR feature selection and analysis
    Cai, Yudong
    Huang, Tao
    Hu, Lele
    Shi, Xiaohe
    Xie, Lu
    Li, Yixue
    [J]. AMINO ACIDS, 2012, 42 (04) : 1387 - 1395
  • [4] YPD™, PombePD™ and WormPD™:: model organism volumes of the BioKnowledge™ Library, an integrated resource for protein information
    Costanzo, MC
    Crawford, ME
    Hirschman, JE
    Kranz, JE
    Olsen, P
    Robertson, LS
    Skrzypek, MS
    Braun, BR
    Hopkins, KL
    Kondu, P
    Lengieza, C
    Lew-Smith, JE
    Tillberg, M
    Garrels, JI
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (01) : 75 - 79
  • [5] Minimum redundancy feature selection from microarray gene expression data
    Ding, C
    Peng, HC
    [J]. PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE, 2003, : 523 - 528
  • [6] Prediction of protein-protein interactions from primary sequences
    Dong, Qiwen
    Zhou, Shuigeng
    Liu, Xuan
    [J]. INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2010, 4 (02) : 211 - 227
  • [7] Identification of function-associated loop motifs and application to protein function prediction
    Espadaler, Jordi
    Querol, Enrique
    Aviles, Francesc X.
    Oliva, Baldo
    [J]. BIOINFORMATICS, 2006, 22 (18) : 2237 - 2243
  • [8] Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences
    Guo, Yanzhi
    Yu, Lezheng
    Wen, Zhining
    Li, Menglong
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 (09) : 3025 - 3030
  • [9] Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI [DOI 10.1145/1656274.1656278, 10.1145/1656274.1656278]
  • [10] Enhanced automated function prediction using distantly related sequences and contextual association by PFP
    Hawkins, Troy
    Luban, Stanislav
    Kihara, Daisuke
    [J]. PROTEIN SCIENCE, 2006, 15 (06) : 1550 - 1556