Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives

被引:17
作者
Akhtyamova, Liliya [1 ]
Martinez, Paloma [2 ]
Verspoor, Karin [3 ,4 ]
Cardiff, John [1 ]
机构
[1] Technol Univ Dublin, Dept Comp, Tallaght Campus, Dublin D06 F793, Ireland
[2] Carlos III Univ Madrid, Comp Sci Dept, Madrid 28903, Spain
[3] Univ Melbourne, Sch Comp & Informat Syst, Melbourne, Vic 3010, Australia
[4] Univ Melbourne, Med Sch, Melbourne, Vic 3010, Australia
关键词
Task analysis; Natural language processing; Artificial neural networks; Biological system modeling; Drugs; Testing; Data mining; Clinical case narratives; contextualized word embeddings; deep learning; language representations; named entity recognition; natural language processing; spanish language; NAMED ENTITY RECOGNITION;
D O I
10.1109/ACCESS.2020.3018688
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the Big Data era, there is an increasing need to fully exploit and analyze the huge quantity of information available about health. Natural Language Processing (NLP) technologies can contribute by extracting relevant information from unstructured data contained in Electronic Health Records (EHR) such as clinical notes, patients' discharge summaries and radiology reports. The extracted information can help in health-related decision making processes. The Named Entity Recognition (NER) task, which detects important concepts in texts (e.g., diseases, symptoms, drugs, etc.), is crucial in the information extraction process yet has received little attention in languages other than English. In this work, we develop a deep learning-based NLP pipeline for biomedical entity extraction in Spanish clinical narratives. We explore the use of contextualized word embeddings, which incorporate context variation into word representations, to enhance named entity recognition in Spanish language clinical text, particularly of pharmacological substances, compounds, and proteins. Various combinations of word and sense embeddings were tested on the evaluation corpus of the PharmacoNER 2019 task, the Spanish Clinical Case Corpus (SPACCC). This data set consists of clinical case sections extracted from open access Spanish-language medical publications. Our study shows that our deep-learning-based system with domain-specific contextualized embeddings coupled with stacking of complementary embeddings yields superior performance over a system with integrated standard and general-domain word embeddings. With this system, we achieve performance competitive with the state-of-the-art.
引用
收藏
页码:164717 / 164726
页数:10
相关论文
共 53 条
[1]  
Akbik A, 2018, COLING 2018, P1638
[2]  
Akbik A, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P724
[3]  
Akhtyamova L, 2020, PROC CONF OPEN INNOV, P3, DOI [10.23919/fruct48808.2020.9087359, 10.23919/FRUCT48808.2020.9087359]
[4]   A Large-Scale CNN Ensemble for Medication Safety Analysis [J].
Akhtyamova, Liliya ;
Ignatov, Andrey ;
Cardiff, John .
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2017, 2017, 10260 :247-253
[5]  
[Anonymous], 2013, NIPS
[6]  
[Anonymous], 2011, Proceedings of BioNLP Shared Task 2011 Workshop, page
[7]  
[Anonymous], **DATA OBJECT**, DOI DOI 10.5281/ZEN0D0.2542722
[8]  
Basaldella M., 2019, P 10 INT WORKSH HLTH, P34
[9]  
Bojanowski P., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACL_A_00051, DOI 10.1162/TACLA00051]
[10]  
Collins M, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P489