Probability Weighted Ensemble Transfer Learning for Predicting Interactions between HIV-1 and Human Proteins

被引:32
作者
Mei, Suyu [1 ]
机构
[1] Shenyang Normal Univ, Software Coll, Shenyang, Peoples R China
关键词
GENE ONTOLOGY; DATA SETS; GP120; DATABASE; LOCALIZATION; COEXPRESSION; INFORMATION; ACTIVATION; MECHANISM;
D O I
10.1371/journal.pone.0079606
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Reconstruction of host-pathogen protein interaction networks is of great significance to reveal the underlying microbic pathogenesis. However, the current experimentally-derived networks are generally small and should be augmented by computational methods for less-biased biological inference. From the point of view of computational modelling, data scarcity, data unavailability and negative data sampling are the three major problems for host-pathogen protein interaction networks reconstruction. In this work, we are motivated to address the three concerns and propose a probability weighted ensemble transfer learning model for HIV-human protein interaction prediction (PWEN-TLM), where support vector machine (SVM) is adopted as the individual classifier of the ensemble model. In the model, data scarcity and data unavailability are tackled by homolog knowledge transfer. The importance of homolog knowledge is measured by the ROC-AUC metric of the individual classifiers, whose outputs are probability weighted to yield the final decision. In addition, we further validate the assumption that only the homolog knowledge is sufficient to train a satisfactory model for host-pathogen protein interaction prediction. Thus the model is more robust against data unavailability with less demanding data constraint. As regards with negative data construction, experiments show that exclusiveness of subcellular co-localized proteins is unbiased and more reliable than random sampling. Last, we conduct analysis of overlapped predictions between our model and the existing models, and apply the model to novel host-pathogen PPIs recognition for further biological research.
引用
收藏
页数:13
相关论文
共 43 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], ADV LARGE MARGIN CLA
[3]   The GOA database in 2009-an integrated Gene Ontology Annotation resource [J].
Barrell, Daniel ;
Dimmer, Emily ;
Huntley, Rachael P. ;
Binns, David ;
O'Donovan, Claire ;
Apweiler, Rolf .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D396-D403
[4]   Choosing negative examples for the prediction of protein-protein interactions [J].
Ben-Hur, A ;
Noble, WS .
BMC BIOINFORMATICS, 2006, 7 (Suppl 1)
[5]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[6]   Signaling mechanism of HIV-1 gp120 and virion-induced IL-1β release in primary human macrophages [J].
Cheung, Ricky ;
Ravyn, Vipa ;
Wang, Lingshu ;
Ptasznik, Andrzej ;
Collman, Ronald G. .
JOURNAL OF IMMUNOLOGY, 2008, 180 (10) :6675-6684
[7]  
Davis J., 2006, P 23 INT C MACH LEAR
[8]   Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression [J].
De Bodt, Stefanie ;
Proost, Sebastian ;
Vandepoele, Klaas ;
Rouze, Pierre ;
Van de Peer, Yves .
BMC GENOMICS, 2009, 10 :288
[9]   Fast SVM training algorithm with decomposition on very large data sets [J].
Dong, JX ;
Krzyzak, A ;
Suen, CY .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (04) :603-618
[10]  
Doolittle J, 2010, VIROL J, V7, P82