Measuring the effect of different types of unsupervised word representations on Medical Named Entity Recognition

被引:10
作者
Casillas, Arantza [1 ]
Ezeiza, Nerea [1 ]
Goenaga, Takes [1 ]
Perez, Alicia [1 ]
Soto, Xabier [1 ]
机构
[1] Univ Basque Country UPV EHU, IXA Grp, Manuel Lardizabal 1, Donostia San Sebastian 20080, Spain
关键词
Electronic Health Records; Medical Named Entity Recognition; Health Information Systems; Neural network; NORMALIZATION; SPANISH;
D O I
10.1016/j.ijmedinf.2019.05.022
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Background: This work deals with Natural Language Processing applied to the clinical domain. Specifically, the work deals with a Medical Entity Recognition (MER) on Electronic Health Records (EHRs). Developing a MER system entailed heavy data preprocessing and feature engineering until Deep Neural Networks (DNNs) emerged. However, the quality of the word representations in terms of embedded layers is still an important issue for the inference of the DNNs. Goal: The main goal of this work is to develop a robust MER system adapting general-purpose DNNs to cope with the high lexical variability shown in EHRs. In addition, given that EHRs tend to be scarce when there are outdomain corpora available, the aim is to assess the impact of the word representations on the performance of the MER as we move to other domains. In this line, exhaustive experimentation varying information generation methods and network parameters are crucial. Methods: We adapted a general purpose sequential tagger based on Bidirectional Long-Short Term Memory cells and Conditional Random Fields (CREs) in order to make it tolerant to high lexical variability and a limited amount of corpora. To this end, we incorporated part of speech (POS) and semantic-tag embedding layers to the word representations. Results: One of the strengths of this work is the exhaustive evaluation of dense word representations obtained varying not only the domain and genre but also the learning algorithms and their parameter settings. With the proposed method, we attained an error reduction of 1.71 (5.7%) compared to the state-of-the-art even that no preprocessing or feature engineering was used. Conclusions: Our results indicate that dense representations built taking word order into account leverage the entity extraction system. Besides, we found that using a medical corpus (not necessarily EHRs) to infer the representations improves the performance, even if it does not correspond to the same genre.
引用
收藏
页码:100 / 106
页数:7
相关论文
共 38 条
[1]  
[Anonymous], 2016, PROC C N AM CHAP ASS, DOI [DOI 10.18653/V1/N16-1118, 10.18653/v1/N16-1118]
[2]  
Brown P. F., 1992, Computational Linguistics, V18, P467
[3]   Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD [J].
Bullinaria, John A. ;
Levy, Joseph P. .
BEHAVIOR RESEARCH METHODS, 2012, 44 (03) :890-907
[4]  
Chiu Billy, 2016, P 1 WORKSHOP EVALUAT, P1, DOI DOI 10.18653/V1/W16-2501
[5]  
Clifton D A, 2015, Yearb Med Inform, V10, P38, DOI 10.15265/IY-2015-014
[6]   Comorbidity Clusters in Autism Spectrum Disorders: An Electronic Health Record Time-Series Analysis [J].
Doshi-Velez, Finale ;
Ge, Yaorong ;
Kohane, Isaac .
PEDIATRICS, 2014, 133 (01) :E54-E63
[7]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[8]  
Jagannatha Abhyuday N, 2016, Proc Conf, V2016, P473
[9]  
Jagannatha Abhyuday N, 2016, Proc Conf Empir Methods Nat Lang Process, V2016, P856
[10]   Analysis of free text in electronic health records for identification of cancer patient trajectories [J].
Jensen, Kasper ;
Soguero-Ruiz, Cristina ;
Mikalsen, Karl Oyvind ;
Lindsetmo, Rolv-Ole ;
Kouskoumvekaki, Irene ;
Girolami, Mark ;
Skrovseth, Stein Olav ;
Augestad, Knut Magne .
SCIENTIFIC REPORTS, 2017, 7