Document classification for mining host pathogen protein-protein interactions

被引:14
作者
Yin, Lanlan [1 ]
Xu, Guixian [1 ,2 ,3 ]
Torii, Manabu [4 ]
Niu, Zhendong [2 ]
Maisog, Jose M. [1 ,5 ]
Wu, Cathy
Hu, Zhangzhi [6 ]
Liu, Hongfang [1 ]
机构
[1] Georgetown Univ, Dept Biostat Bioinformat & Biomath, Washington, DC USA
[2] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing, Peoples R China
[4] Georgetown Univ, Med Ctr, Imaging Sci & Informat Syst Ctr, Washington, DC 20007 USA
[5] Med Numer Inc, Germantown, MD USA
[6] Georgetown Univ, Med Ctr, Dept Oncol, Washington, DC 20007 USA
关键词
Document classification; Host pathogen protein-protein interaction; Feature selection; Literature mining;
D O I
10.1016/j.artmed.2010.04.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host pathogen protein-protein interactions (HP-PPIs) from literature. Methods: In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to HP-PPIs. An annotated corpus consisting of 1360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of three feature selection methods:information gain (IC), chi(2) test, and specific mutual information (SI). The performance was measured using normalized discounted cumulative gain (NDCG) and positive predictive value (PPV) and all measures were obtained through 10-fold cross validation. Results: NDCG measures for classification systems using all features or a subset of features selected using IC and chi(2) test range from 0.83 to 0.89 while classification systems built based on features selected using SI had relatively lower NDCG measures. The classification system achieved a PPV of 50.7% for the top 10% ranked documents comparing to a baseline PPV of 10.0%. Conclusions: Our results indicate that document classification systems can be constructed to efficiently retrieve HP-PPI related documents. Feature selection was effective in reducing the dimensionality of features to build a compact system. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:155 / 160
页数:6
相关论文
共 15 条
[1]   DIAGNOSTIC-TESTS-2 - PREDICTIVE VALUES .4. [J].
ALTMAN, DG ;
BLAND, JM .
BRITISH MEDICAL JOURNAL, 1994, 309 (6947) :102-102
[2]  
[Anonymous], 1999, REPOSIT TU DORTMUND, DOI DOI 10.17877/DE290R-5098
[3]  
[Anonymous], 2008, Introduction to information retrieval
[4]  
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[5]   The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[6]   A response to Webb and Ting's On the application of ROC analysis to predict classification performance under varying class distributions [J].
Fawcett, T ;
Flach, PA .
MACHINE LEARNING, 2005, 58 (01) :33-38
[7]   Biomedical language processing: What's beyond PubMed? [J].
Hunter, L ;
Cohen, KB .
MOLECULAR CELL, 2006, 21 (05) :589-594
[8]  
Joachims T., EUR C MACH LEARN, P137, DOI DOI 10.1007/BFB0026683
[9]   GENIA corpus-a semantically annotated corpus for bio-textmining [J].
Kim, J-D ;
Ohta, T. ;
Tateisi, Y. ;
Tsujii, J. .
BIOINFORMATICS, 2003, 19 :i180-i182
[10]   Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge [J].
Krallinger, Martin ;
Morgan, Alexander ;
Smith, Larry ;
Leitner, Florian ;
Tanabe, Lorraine ;
Wilbur, John ;
Hirschman, Lynette ;
Valencia, Alfonso .
GENOME BIOLOGY, 2008, 9