Pneumonia identification using statistical feature selection

被引:40
作者
Bejan, Cosmin Adrian [1 ]
Xia, Fei [1 ,2 ]
Vanderwende, Lucy [1 ,3 ]
Wurfel, Mark M. [4 ]
Yetisgen-Yildiz, Meliha [1 ,2 ]
机构
[1] Univ Washington, Sch Med, Dept Biomed & Hlth Informat, Seattle, WA 98195 USA
[2] Univ Washington, Dept Linguist, Seattle, WA 98195 USA
[3] Microsoft Res, Redmond, WA USA
[4] Univ Washington, Sch Med, Div Pulm & Crit Care Med, Seattle, WA 98195 USA
关键词
X-RAY REPORTS; ALGORITHM; DISEASES; TEXT;
D O I
10.1136/amiajnl-2011-000752
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective This paper describes a natural language processing system for the task of pneumonia identification. Based on the information extracted from the narrative reports associated with a patient, the task is to identify whether or not the patient is positive for pneumonia. Design A binary classifier was employed to identify pneumonia from a dataset of multiple types of clinical notes created for 426 patients during their stay in the intensive care unit. For this purpose, three types of features were considered: (1) word n-grams, (2) Unified Medical Language System (UMLS) concepts, and (3) assertion values associated with pneumonia expressions. System performance was greatly increased by a feature selection approach which uses statistical significance testing to rank features based on their association with the two categories of pneumonia identification. Results Besides testing our system on the entire cohort of 426 patients (unrestricted dataset), we also used a smaller subset of 236 patients (restricted dataset). The performance of the system was compared with the results of a baseline previously proposed for these two datasets. The best results achieved by the system (85.71 and 81.67 F1-measure) are significantly better than the baseline results (50.70 and 49.10 F1-measure) on the restricted and unrestricted datasets, respectively. Conclusion Using a statistical feature selection approach that allows the feature extractor to consider only the most informative features from the feature space significantly improves the performance over a baseline that uses all the features from the same feature space. Extracting the assertion value for pneumonia expressions further improves the system performance.
引用
收藏
页码:817 / 823
页数:7
相关论文
共 28 条
[1]  
[Anonymous], 1997, ICML
[2]  
[Anonymous], 2007, BIOL TRANSLATIONAL C
[3]  
Aronsky D, 2001, J AM MED INFORM ASSN, P12
[4]  
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[5]   A simple algorithm for identifying negated findings and diseases in discharge summaries [J].
Chapman, WW ;
Bridewell, W ;
Hanbury, P ;
Cooper, GF ;
Buchanan, BG .
JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (05) :301-310
[6]  
Chapman WW, 1999, J AM MED INFORM ASSN, P216
[7]   A comparison of classification algorithms to automatically identify chest X-ray reports that support pneumonia [J].
Chapman, WW ;
Fizman, M ;
Chapman, BE ;
Huag, PJ .
JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (01) :4-14
[8]   What can natural language processing do for clinical decision support? [J].
Demner-Fushman, Dina ;
Chapman, Wendy W. ;
McDonald, Clement J. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) :760-772
[9]  
Elkin Peter L, 2008, AMIA Annu Symp Proc, P172
[10]  
Fan RE, 2008, J MACH LEARN RES, V9, P1871