Classifying disease outbreak reports using n-grams and semantic features

被引:35
作者
Conway, Mike [1 ]
Doan, Son [1 ]
Kawazoe, Ai [1 ]
Collier, Nigel [1 ]
机构
[1] Res Org Informat & Syst, Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
关键词
Text classification; Feature selection; Text mining; Information extraction; Disease tracking;
D O I
10.1016/j.ijmedinf.2009.03.010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Introduction: This paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger - the USAS tagger - to generate features. Background: We outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus). Feature sets: Three broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger. Methodology: Three standard machine learning algorithms - Naive Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm - were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the chi(2) feature selection algorithm. Standard text classification performance metrics - Accuracy, Precision, Recall, Specificity and F-score - are reported. Results: A feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naive Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance. Conclusion: This study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain. (C) 2009 Elsevier Ireland Ltd. All rights reserved.
引用
收藏
页码:E47 / E58
页数:12
相关论文
共 24 条
[1]  
BOUCKAERT R, 2004, CHAPTER EVALUATING R, P3
[2]   BioCaster: detecting public health rumors with a Web-based text mining system [J].
Collier, Nigel ;
Doan, Son ;
Kawazoe, Ai ;
Goodwin, Reiko Matsuda ;
Conway, Mike ;
Tateno, Yoshio ;
Quoc-Hung Ngo ;
Dinh Dien ;
Kawtrakul, Asanee ;
Takeuchi, Koichi ;
Shigematsu, Mika ;
Taniguchi, Kiyosu .
BIOINFORMATICS, 2008, 24 (24) :2940-2941
[3]  
CONWAY M, 2008, P 3 INT S SEM MIN BI, P29
[4]  
DOAN S, 2007, P ACL 2007 WORKSH BI, P17
[5]  
Doan S., 2008, P INT JOINT C NAT LA, P951
[6]   Towards role-based filtering of disease outbreak reports [J].
Doan, Son ;
Kawazoe, Ai ;
Conway, Mike ;
Collier, Nigel .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) :773-780
[7]  
Feldman R., 2007, The text mining handbook: advanced approaches in analyzing unstructured data, DOI 10.1017/CBO9780511546914
[8]  
Fellbaum C, 1998, LANG SPEECH & COMMUN, P1
[9]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[10]  
Heymann D L, 2001, Lancet Infect Dis, V1, P345, DOI 10.1016/S1473-3099(01)00148-7