Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study

被引：115

作者：

Li, Fei ^{[1
,2
,3
]}

Jin, Yonghao ^{[1
]}

Liu, Weisong ^{[1
,2
,3
]}

Rawat, Bhanu Pratap Singh ^{[4
]}

Cai, Pengshan ^{[4
]}

Yu, Hong ^{[1
,2
,3
,4
]}

机构：

[1] Univ Massachusetts, Dept Comp Sci, 1 Univ Ave, Lowell, MA 01854 USA

[2] Bedford Vet Affairs Med Ctr, Ctr Healthcare Org & Implementat Res, Bedford, MA USA

[3] Univ Massachusetts, Sch Med, Dept Med, Worcester, MA USA

[4] Univ Massachusetts, Sch Comp Sci, Amherst, MA 01003 USA

来源：

JMIR MEDICAL INFORMATICS | 2019年 / 7卷 / 03期

基金：

美国国家卫生研究院;

关键词：

natural language processing; entity normalization; deep learning; electronic health record note; BERT; NAMED ENTITY RECOGNITION; NORMALIZATION;

D O I：

10.2196/14830

中图分类号：

R-058 [];

学科分类号：

摘要：

Background: The bidirectional encoder representations from transformers (BERT) model has achieved great success in many natural language processing (NLP) tasks, such as named entity recognition and question answering. However, little prior work has explored this model to be used for an important task in the biomedical and clinical domains, namely entity normalization. Objective: We aim to investigate the effectiveness of BERT-based models for biomedical or clinical entity normalization. In addition, our second objective is to investigate whether the domains of training data influence the performances of BERT-based models as well as the degree of influence. Methods: Our data was comprised of 1.5 million unlabeled electronic health record (EHR) notes. We first fine-tuned BioBERT on this large collection of unlabeled EHR notes. This generated our BERT-based model trained using 1.5 million electronic health record notes (EhrBERT). We then further fine-tuned EhrBERT, BioBERT, and BERT on three annotated corpora for biomedical and clinical entity normalization: the Medication, Indication, and Adverse Drug Events (MADE) 1.0 corpus, the National Center for Biotechnology Information (NCBI) disease corpus, and the Chemical-Disease Relations (CDR) corpus. We compared our models with two state-of-the-art normalization systems, namely MetaMap and disease name normalization (DNorm). Results: EhrBERT achieved 40.95% F1 in the MADE 1.0 corpus for mapping named entities to the Medical Dictionary for Regulatory Activities and the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT), which have about 380,000 terms. In this corpus, EhrBERT outperformed MetaMap by 2.36% in F1. For the NCBI disease corpus and CDR corpus, EhrBERT also outperformed DNorm by improving the F1 scores from 88.37% and 89.92% to 90.35% and 93.82%, respectively. Compared with BioBERT and BERT, EhrBERT outperformed them on the MADE 1.0 corpus and the CDR corpus. Conclusions: Our work shows that BERT-based models have achieved state-of-the-art performance for biomedical and clinical entity normalization. BERT-based models can be readily fine-tuned to normalize any kind of named entities.

引用

页数：13

共 36 条

[1] [Anonymous], P ASS ADV ART INT AA
[2] [Anonymous], 2019, BioBERT: a pre-trained biomedical language representation model for biomedical text mining
[3] [Anonymous], P 2011 IEEE INT C AC
[4] [Anonymous], LIB STAT OF THE ART
[5] [Anonymous], P 2018 C N AM CHAPT
[6] [Anonymous], DATABASE, DOI DOI 10.1093/DATABASE/BAR065]
[7] [Anonymous], 2000, FDN STAT NATURAL LAN
[8] [Anonymous], FIN TUN BERT US EHR
[9] [Anonymous], 2009, NATURAL LANGUAGE PRO
[10] An overview of MetaMap: historical perspective and recent advances
Aronson, Alan R.
Lang, Francois-Michel
[J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (03) : 229 - 236

← 1 2 3 4 →