Automating classification of free-text electronic health records for epidemiological studies

被引：20

作者：

Schuemie, Martijn J. ^{[1
]}

Sen, Emine ^{[1
]}

't Jong, Geert W. ^{[1
]}

van Soest, Eva M. ^{[1
]}

Sturkenboom, Miriam C. ^{[1
]}

Kors, Jan A. ^{[1
]}

机构：

[1] Erasmus MC, Dept Med Informat, NL-3000 CA Rotterdam, Netherlands

来源：

PHARMACOEPIDEMIOLOGY AND DRUG SAFETY | 2012年 / 21卷 / 06期

关键词：

free text; text mining; case definition; machine learning; method; QUALITY;

D O I：

10.1002/pds.3205

中图分类号：

R1 [预防医学、卫生学];

学科分类号：

1004 ; 120402 ;

摘要：

Purpose Increasingly, patient information is stored in electronic medical records, which could be reused for research. Often these records comprise unstructured narrative data, which are cumbersome to analyze. The authors investigated whether text mining can make these data suitable for epidemiological studies and compared a concept recognition approach and a range of machine learning techniques that require a manually annotated training set. The authors show how this training set can be created with minimal effort by using a broad database query. Methods The approaches were tested on two data sets: a publicly available set of English radiology reports for which International Classification of Diseases, Ninth Revision, Clinical Modification code needed to be assigned and a set of Dutch GP records that needed to be classified as either liver disorder cases or noncases. Performance was tested against a manually created gold standard. Results The best overall performance was achieved by a combination of a manually created filter for removing negations and speculations and rule learning algorithms such as RIPPER, with high scores on both the radiology reports (positive predictive value = 0.88, sensitivity = 0.85, specificity = 1.00) and the GP records (positive predictive value = 0.89, sensitivity =0.91, specificity =0.76). Conclusions Although a training set still needs to be created manually, text mining can help reduce the amount of manual work needed to incorporate narrative data in an epidemiological study and will make the data extraction more reproducible. An advantage of machine learning is that it is able to pick up specific language use, such as abbreviations and synonyms used by physicians. Copyright (C) 2012 John Wiley & Sons, Ltd.

引用

页码：651 / 658

页数：8

共 26 条

[1]

[Anonymous], 2014, C4. 5: programs for machine learning

[2]

Aronson AR., 2001, Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.. s.l, P17

[3]

Aronson AR, 2007, P ACL 2007 WORKSH BI

[4] The Unified Medical Language System (UMLS): integrating biomedical terminology [J].

Bodenreider, O .

NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270

[5] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[6] Review article: drug hepatotoxicity [J].

Chang, C. Y. ;

Schiano, T. D. .

ALIMENTARY PHARMACOLOGY & THERAPEUTICS, 2007, 25 (10) :1135-1151

[7] LIBSVM: A Library for Support Vector Machines [J].

Chang, Chih-Chung ;

Lin, Chih-Jen .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)

[8] A simple algorithm for identifying negated findings and diseases in discharge summaries [J].

Chapman, WW ;

Bridewell, W ;

Hanbury, P ;

Cooper, GF ;

Buchanan, BG .

JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (05) :301-310

[9] The urgent need to improve health care quality - Institute of medicine National Roundtable on Health Care Quality [J].

Chassin, MR ;

Galvin, RW .

JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 1998, 280 (11) :1000-1005

[10] Systematic review: Impact of health information technology on quality, efficiency, and costs of medical care [J].

Chaudhry, Basit ;

Wang, Jerome ;

Wu, Shinyi ;

Maglione, Margaret ;

Mojica, Walter ;

Roth, Elizabeth ;

Morton, Sally C. ;

Shekelle, Paul G. .

ANNALS OF INTERNAL MEDICINE, 2006, 144 (10) :742-752

← 1 2 3 →