Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT

被引：2

作者：

Wada, Shoya ^{[1
]}

Takeda, Toshihiro ^{[1
]}

Okada, Katsuki ^{[1
]}

Manabe, Shirou ^{[1
]}

Konishi, Shozo ^{[1
]}

Kamohara, Jun ^{[2
]}

Matsumura, Yasushi ^{[1
]}

机构：

[1] Osaka Univ, Grad Sch Med, Dept Med Informat, 2-2 Yamadaoka, Suita, Osaka 5650871, Japan

[2] Osaka Univ, Fac Med, Suita, Japan

来源：

ARTIFICIAL INTELLIGENCE IN MEDICINE | 2024年 / 153卷

关键词：

Natural language processing; Deep learning; BERT;

D O I：

10.1016/j.artmed.2024.102889

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Background: Pretraining large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has improved significantly in both the general and medical domains. However, it is difficult to train specific BERT models to perform well in domains for which few databases of a high quality and large size are publicly available. Objective: We hypothesized that this problem could be addressed by oversampling a domain-specific corpus and using it for pretraining with a larger corpus in a balanced manner. In the present study, we verified our hypothesis by developing pretraining models using our method and evaluating their performance. Methods: Our proposed method was based on the simultaneous pretraining of models with knowledge from distinct domains after oversampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pretrained with complete PubMed abstracts in a balanced manner. We then compared their performance with those of conventional models. Results: Our English BERT pretrained using both general and small medical domain corpora performed sufficiently well for practical use on the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than the conventional methods for each biomedical corpus of the same corpus size in the general domain. Our Japanese medical BERT outperformed the other BERT models built using a conventional method for almost all the medical tasks. The model demonstrated the same trend as that of the first experiment in English. Further, our enhanced biomedical BERT model, which was not pretrained on clinical notes, achieved superior clinical and biomedical scores on the BLUE benchmark with an increase of 0.3 points in the clinical score and 0.5 points in the biomedical score. These scores were above those of the models trained without our proposed method. Conclusions: Well-balanced pretraining using oversampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

引用

页数：12

共 46 条

[1] Alsentzer E., 2019, P 2 CLIN NAT LANG PR, P72, DOI [DOI 10.18653/V1/W19-1909, 10.18653/v1/W19-1909]
[2] Automatic semantic classification of scientific literature according to the hallmarks of cancer
Baker, Simon
Silins, Ilona
Guo, Yufan
Ali, Imran
Hogberg, Johan
Stenius, Ulla
Korhonen, Anna
[J]. BIOINFORMATICS, 2016, 32 (03) : 432 - 440
[3] Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[4] Bowman S. R., 2015, P 2015 C EMPIRICAL M, P632, DOI DOI 10.18653/V1/D15-1075
[5] SMOTE: Synthetic minority over-sampling technique
Chawla, Nitesh V.
Bowyer, Kevin W.
Hall, Lawrence O.
Kegelmeyer, W. Philip
[J]. 2002, American Association for Artificial Intelligence (16)
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Gu Y.
Tinn R.
Cheng H.
Lucas M.
Usuyama N.
Liu X.
Naumann T.
Gao J.
Poon H.
[J]. ACM Transactions on Computing for Healthcare, 2022, 3 (01):
[8] Gururangan Suchin, 2020, P 58 ANN M ASS COMPU, P8342, DOI DOI 10.18653/V1/2020.ACL-MAIN.740
[9] ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
He, Haibo
Bai, Yang
Garcia, Edwardo A.
Li, Shutao
[J]. 2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, : 1322 - 1328
[10] The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions
Herrero-Zazo, Maria
Segura-Bedmar, Isabel
Martinez, Paloma
Declerck, Thierry
[J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2013, 46 (05) : 914 - 920

← 1 2 3 4 5 →