Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

被引:0
|
作者
Vakili, Thomas [1 ]
Lamproudis, Anastasios [1 ]
Henriksson, Aron [1 ]
Dalianis, Hercules [1 ]
机构
[1] Stockholm Univ, Dept Comp & Syst Sci DSV, Kista, Sweden
来源
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年
关键词
Privacy-preserving machine learning; pseudonymization; de-identification; Swedish clinical text; pre-trained language models; BERT; downstream tasks; NER; multi-label classification; domain adaptation; TEXT;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.
引用
收藏
页码:4245 / 4252
页数:8
相关论文
共 37 条
  • [1] μBERT: Mutation Testing using Pre-Trained Language Models
    Degiovanni, Renzo
    Papadakis, Mike
    2022 IEEE 15TH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION WORKSHOPS (ICSTW 2022), 2022, : 160 - 169
  • [2] Performance Evaluation of Pre-trained Models in Sarcasm Detection Task
    Wang, Haiyang
    Song, Xin
    Zhou, Bin
    Wang, Ye
    Gao, Liqun
    Jia, Yan
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT II, 2021, 13081 : 67 - 75
  • [3] Identifying Valid User Stories Using BERT Pre-trained Natural Language Models
    Scoggin, Sandor Borges
    Marques-Neto, Humberto Torres
    INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 3, WORLDCIST 2023, 2024, 801 : 167 - 177
  • [4] Detecting Dementia from Transcribed Speech in Slovak using Pre-trained BERT Models
    Stas, Jan
    Hladek, Daniel
    Kopnicky, Ales
    2024 34TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA, RADIOELEKTRONIKA 2024, 2024,
  • [5] De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT
    An, Jiyong
    Kim, Jiyun
    Sunwoo, Leonard
    Baek, Hyunyoung
    Yoo, Sooyoung
    Lee, Seunggeun
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2025, 25 (01)
  • [6] Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data
    Wang, Hongfei
    Kurosawa, Michiki
    Katsumata, Satoru
    Mita, Masato
    Komachi, Mamoru
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (03)
  • [7] Extracting medication changes in clinical narratives using pre-trained language models
    Ramachandran, Giridhar Kaushik
    Lybarger, Kevin
    Liu, Yaya
    Mahajan, Diwakar
    Liang, Jennifer J.
    Tsou, Ching-Huei
    Yetisgen, Meliha
    Uzuner, Ozlem
    JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 139
  • [8] DIA-BERT: pre-trained end-to-end transformer models for enhanced DIA proteomics data analysis
    Zhiwei Liu
    Pu Liu
    Yingying Sun
    Zongxiang Nie
    Xiaofan Zhang
    Yuqi Zhang
    Yi Chen
    Tiannan Guo
    Nature Communications, 16 (1)
  • [9] Performance Analysis of Federated Learning Algorithms for Multilingual Protest News Detection Using Pre-Trained DistilBERT and BERT
    Riedel, Pascal
    Reichert, Manfred
    Von Schwerin, Reinhold
    Hafner, Alexander
    Schaudt, Daniel
    Singh, Gaurav
    IEEE ACCESS, 2023, 11 : 134009 - 134022
  • [10] Impact of data quality for automatic issue classification using pre-trained language models
    Colavito, Giuseppe
    Lanubile, Filippo
    Novielli, Nicole
    Quaranta, Luigi
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 210