An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics

被引:31
作者
Torii, Manabu [1 ]
Yin, Lanlan [2 ]
Nguyen, Thang [1 ]
Mazumdar, Chand T. [1 ]
Liu, Hongfang [2 ]
Hartley, David M. [1 ,2 ,3 ,4 ]
Nelson, Noele P. [1 ,5 ]
机构
[1] Georgetown Univ, Med Ctr, ISIS Ctr, Washington, DC 20057 USA
[2] Georgetown Univ, Med Ctr, Dept Biostat Bioinformat & Biomath, Washington, DC 20057 USA
[3] Georgetown Univ, Med Ctr, Dept Microbiol & Immunol, Washington, DC 20057 USA
[4] Georgetown Univ, Med Ctr, Dept Radiol, Washington, DC 20057 USA
[5] Georgetown Univ, Med Ctr, Dept Pediat, Washington, DC 20057 USA
关键词
Natural language processing; Information storage and retrieval; Medical informatics applications; Disease notification; Disease outbreaks; Biosurveillance; Internet; AGREEMENT; HEALTHMAP; MACHINE; SUPPORT;
D O I
10.1016/j.ijmedinf.2010.10.015
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. Methods: Naive Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. Results: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naive Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. Conclusions: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages. (C) 2010 Elsevier Ireland Ltd. All rights reserved.
引用
收藏
页码:56 / 66
页数:11
相关论文
共 40 条
[1]  
AMATOGAUCI A, 2008, EURO SURVEILL, V13
[2]  
[Anonymous], 1999, Advances in kernel methods: Support vector learning
[3]  
[Anonymous], EURO SURVEILL
[4]  
[Anonymous], 2008, Introduction to information retrieval
[5]  
[Anonymous], 8 AMTA C HAW OCT 21
[6]  
[Anonymous], 1997, ICML
[7]   Inter-Coder Agreement for Computational Linguistics [J].
Artstein, Ron ;
Poesio, Massimo .
COMPUTATIONAL LINGUISTICS, 2008, 34 (04) :555-596
[8]  
BENNET PN, 2000, ASSESSING CALIBRATIO
[9]   Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project [J].
Brownstein, John S. ;
Freifeld, Clark C. ;
Reis, Ben Y. ;
Mandl, Kenneth D. .
PLOS MEDICINE, 2008, 5 (07) :1019-1024
[10]  
Chang C.C., 2009, LIBSVM LIB SUPPORT V