Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection

被引：78

作者：

Botsis, Taxiarchis ^{[1
,2
]}

Nguyen, Michael D. ^{[1
]}

Woo, Emily Jane ^{[1
]}

Markatou, Marianthi ^{[3
,4
]}

Ball, Robert ^{[1
]}

机构：

[1] CBER, Off Biostat & Epidemiol, FDA, Rockville, MD 20852 USA

[2] Univ Tromso, Dept Comp Sci, Tromso, Norway

[3] Cornell Univ, Dept Stat Sci, New York, NY 10021 USA

[4] IBM TJ Watson Res Ctr, New York, NY USA

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2011年 / 18卷 / 05期

关键词：

RECORDS; MEDDRA;

D O I：

10.1136/amiajnl-2010-000022

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objective The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload. Design We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (N-pos=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations. Measurements Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed. Results Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively). Conclusion Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably.

引用

页码：631 / 638

页数：8

共 48 条

[1] A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection [J].

Ambert, Kyle H. ;

Cohen, Aaron M. .

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2009, 16 (04) :590-595

[2]

Androutsopoulos I., 2000, An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages, P160

[3]

[Anonymous], 1997, ICML

[4]

Bekkerman R., 2004, Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora, P418

[5] INFORMATION FILTERING AND INFORMATION-RETRIEVAL - 2 SIDES OF THE SAME COIN [J].

BELKIN, NJ ;

CROFT, WB .

COMMUNICATIONS OF THE ACM, 1992, 35 (12) :29-38

[6] The Brighton Collaboration: addressing the need for standardized case definitions of adverse events following immunization (AEFI) [J].

Bonhoeffer, J ;

Kohl, K ;

Chen, R ;

Duclos, P ;

Heijbel, H ;

Heininger, U ;

Jefferson, T ;

Loupi, E .

VACCINE, 2002, 21 (3-4) :298-302

[7] Appraisal of the MedDRA conceptual structure for describing and grouping adverse drug reactions [J].

Bousquet, C ;

Lagier, G ;

Louët, ALL ;

Le Beller, C ;

Venot, A ;

Jaulent, MC .

DRUG SAFETY, 2005, 28 (01) :19-34

[8] Using MedDRA - Implications for risk management [J].

Brown, EG .

DRUG SAFETY, 2004, 27 (08) :591-602

[9]

*CAN MIN HLTH, 2010, QUAL INV COMB LOT A8

[10]

CARRERAS X, 2001, 4 INT C REC ADV NAT

← 1 2 3 4 5 →