Detecting disease genes based on semi-supervised learning and protein-protein interaction networks

被引:49
|
作者
Thanh-Phuong Nguyen [1 ]
Tu-Bao Ho [2 ,3 ]
机构
[1] Microsoft Res Univ Trento Ctr Computat & Syst Bio, I-38123 Trento, Italy
[2] Japan Adv Inst Sci & Technol, Nomi, Ishikawa 9231292, Japan
[3] Vietnam Acad Sci & Technol, Hanoi, Vietnam
关键词
Semi-supervised learning; Protein-protein interaction network; Multiple data resources integration; Disease gene neighbours; Disease-causing gene prediction; TOPOLOGICAL FEATURES; CANCER; EXPRESSION; PATTERNS; FYN;
D O I
10.1016/j.artmed.2011.09.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: Predicting or prioritizing the human genes that cause disease, or "disease genes", is one of the emerging tasks in biomedicine informatics. Research on network-based approach to this problem is carded out upon the key assumption of "the network-neighbour of a disease gene is likely to cause the same or a similar disease", and mostly employs data regarding well-known disease genes, using supervised learning methods. This work aims to find an effective method to exploit the disease gene neighbourhood and the integration of several useful omics data sources, which potentially enhance disease gene predictions. Methods: We have presented a novel method to effectively predict disease genes by exploiting, in the semi-supervised learning (SSL) scheme, data regarding both disease genes and disease gene neighbours via protein-protein interaction network. Multiple proteomic and genomic data were integrated from six biological databases, including Universal Protein Resource, Interologous Interaction Database, Reactome, Gene Ontology, Pfam, and InterDom, and a gene expression dataset. Results: By employing a 10 times stratified 10-fold cross validation, the SSL method performs better than the k-nearest neighbour method and the support vector machines method in terms of sensitivity of 85%, specificity of 79%, precision of 81%, accuracy of 82%, and a balanced F-function of 83%. The other comparative experimental evaluations demonstrate advantages of the proposed method given a small amount of labeled data with accuracy of 78%. We have applied the proposed method to detect 572 putative disease genes, which are biologically validated by some indirect ways. Conclusion: Semi-supervised learning improved ability to study disease genes, especially a specific disease when the known disease genes (as labeled data) are very often limited. In addition to the computational improvement, the analysis of predicted disease proteins indicates that the findings are beneficial in deciphering the pathogenic mechanisms. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:63 / 71
页数:9
相关论文
共 50 条
  • [1] Semi-Supervised Learning of Text Classification on Bacterial Protein-Protein Interaction documents
    Xu, Guixian
    Niu, Zhendong
    Uetz, Peter
    Gao, Xu
    Qin, Xuping
    Liu, Hongfang
    2009 INTERNATIONAL JOINT CONFERENCE ON BIOINFORMATICS, SYSTEMS BIOLOGY AND INTELLIGENT COMPUTING, PROCEEDINGS, 2009, : 263 - +
  • [2] Analysis of human genes with protein-protein interaction network for detecting disease genes
    Wu, Shun-yao
    Shao, Feng-jing
    Sun, Ren-cheng
    Sui, Yi
    Wang, Ying
    Wang, Jin-long
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2014, 398 : 217 - 228
  • [3] Semi-supervised learning of the hidden vector state model for extracting protein-protein interactions
    Zhou, Deyu
    He, Yulan
    Kwoh, Chee Keong
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2007, 41 (03) : 209 - 222
  • [4] Semi-supervised learning of the hidden vector state model for protein-protein interactions extraction
    Zhou, Deyu
    He, Yulan
    Kwoh, Chee Keong
    2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, VOLS 1 AND 2, 2007, : 674 - 680
  • [5] Feature-based classification of native and non-native protein-protein interactions: Comparing supervised and semi-supervised learning approaches
    Zhao, Nan
    Pang, Bin
    Shyu, Chi-Ren
    Korkin, Dmitry
    PROTEOMICS, 2011, 11 (22) : 4321 - 4330
  • [6] An accurate classification of native and non-native protein-protein interactions using supervised and semi-supervised learning approaches
    Zhao, Nan
    Pang, Bin
    Shyu, Chi-Ren
    Korkin, Dmitry
    2010 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2010, : 185 - 189
  • [7] Protein Function Prediction Based on Active Semi-supervised Learning
    Wang Xuesong
    Cheng Yuhu
    Li Lijing
    CHINESE JOURNAL OF ELECTRONICS, 2016, 25 (04) : 595 - 600
  • [8] Semi-supervised Protein-Protein Interactions Extraction Method Based on Label Propagation and Sentence Embedding
    Tang, Zhan
    Guo, Xuchao
    Diao, Lei
    Bai, Zhao
    Wang, Longhe
    Li, Lin
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 113 - 121
  • [9] Protein Function Prediction Based on Active Semi-supervised Learning
    WANG Xuesong
    CHENG Yuhu
    LI Lijing
    Chinese Journal of Electronics, 2016, 25 (04) : 595 - 600
  • [10] Detecting Rewiring Events in Protein-Protein Interaction Networks Based on Transcriptomic Data
    Hollander, Markus
    Do, Trang
    Will, Thorsten
    Helms, Volkhard
    FRONTIERS IN BIOINFORMATICS, 2021, 1