Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

被引：0

作者：

Vakili, Thomas ^{[1
]}

Lamproudis, Anastasios ^{[1
]}

Henriksson, Aron ^{[1
]}

Dalianis, Hercules ^{[1
]}

机构：

[1] Stockholm Univ, Dept Comp & Syst Sci DSV, Kista, Sweden

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Privacy-preserving machine learning; pseudonymization; de-identification; Swedish clinical text; pre-trained language models; BERT; downstream tasks; NER; multi-label classification; domain adaptation; TEXT;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.

引用

页码：4245 / 4252

页数：8

共 37 条

[1] μBERT: Mutation Testing using Pre-Trained Language Models
Degiovanni, Renzo
Papadakis, Mike
2022 IEEE 15TH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION WORKSHOPS (ICSTW 2022), 2022, : 160 - 169
[2] Performance Evaluation of Pre-trained Models in Sarcasm Detection Task
Wang, Haiyang
Song, Xin
Zhou, Bin
Wang, Ye
Gao, Liqun
Jia, Yan
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT II, 2021, 13081 : 67 - 75
[3] Identifying Valid User Stories Using BERT Pre-trained Natural Language Models
Scoggin, Sandor Borges
Marques-Neto, Humberto Torres
INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 3, WORLDCIST 2023, 2024, 801 : 167 - 177
[4] Detecting Dementia from Transcribed Speech in Slovak using Pre-trained BERT Models
Stas, Jan
Hladek, Daniel
Kopnicky, Ales
2024 34TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA, RADIOELEKTRONIKA 2024, 2024,
[5] De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT
An, Jiyong
Kim, Jiyun
Sunwoo, Leonard
Baek, Hyunyoung
Yoo, Sooyoung
Lee, Seunggeun
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2025, 25 (01)
[6] Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data
Wang, Hongfei
Kurosawa, Michiki
Katsumata, Satoru
Mita, Masato
Komachi, Mamoru
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (03)
[7] Extracting medication changes in clinical narratives using pre-trained language models
Ramachandran, Giridhar Kaushik
Lybarger, Kevin
Liu, Yaya
Mahajan, Diwakar
Liang, Jennifer J.
Tsou, Ching-Huei
Yetisgen, Meliha
Uzuner, Ozlem
JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 139
[8] DIA-BERT: pre-trained end-to-end transformer models for enhanced DIA proteomics data analysis
Zhiwei Liu
Pu Liu
Yingying Sun
Zongxiang Nie
Xiaofan Zhang
Yuqi Zhang
Yi Chen
Tiannan Guo
Nature Communications, 16 (1)
[9] Performance Analysis of Federated Learning Algorithms for Multilingual Protest News Detection Using Pre-Trained DistilBERT and BERT
Riedel, Pascal
Reichert, Manfred
Von Schwerin, Reinhold
Hafner, Alexander
Schaudt, Daniel
Singh, Gaurav
IEEE ACCESS, 2023, 11 : 134009 - 134022
[10] Impact of data quality for automatic issue classification using pre-trained language models
Colavito, Giuseppe
Lanubile, Filippo
Novielli, Nicole
Quaranta, Luigi
JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 210

← 1 2 3 4 →