A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine (vol 21, 69, 2021)

被引:2
作者
Campillos-Llanos, Leonardo [1 ]
Valverde-Mateos, Ana [2 ]
Capllonch-Carrion, Adrian [3 ]
Moreno-Sandoval, Antonio [1 ]
机构
[1] Univ Autonoma Madrid, Computat Linguist Lab, C Francisco Tomas y Valiente 1 Cantoblanco Campus, Madrid 28049, Spain
[2] Spanish Royal Acad Med, Med Terminol Unit, C Arrieta 12, Madrid 28013, Spain
[3] Complejo Asistencial Hosp Benito Menni, C Jardines 1, Madrid 28350, Spain
基金
欧盟地平线“2020”;
关键词
Clinical Trials; Evidence-Based Medicine; Inter-Annotator Agreement; Natural Language Processing; Semantic Annotation;
D O I
10.1186/s12911-021-01475-0
中图分类号
R-058 [];
学科分类号
摘要
Background: The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus. Methods: We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured inter-annotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models. Results: This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure. Conclusions: Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http://www.lllf.uam.es/ESP/nlpmedterm_en.html. The methods are generalizable to other languages with similar available sources. © 2021, The Author(s).
引用
收藏
页数:1
相关论文
共 1 条
[1]  
Campillos-Llanos L, 2021, BMC MED INFORM DECIS, V21, DOI 10.1186/s12911-021-01395-z