Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

被引:89
|
作者
Rankin, Debbie [1 ]
Black, Michaela [1 ]
Bond, Raymond [2 ]
Wallace, Jonathan [2 ]
Mulvenna, Maurice [2 ]
Epelde, Gorka [3 ,4 ]
机构
[1] Ulster Univ, Sch Comp Engn & Intelligent Syst, Derry Londonderry, North Ireland
[2] Ulster Univ, Sch Comp, Jordanstown, North Ireland
[3] Donostia San Sebastian, Vicomtech Fdn, Donostia San Sebastian, Spain
[4] Biodonostia Hlth Res Inst, eHlth Grp, Donostia San Sebastian, Spain
基金
欧盟地平线“2020”;
关键词
synthetic data; supervised machine learning; data utility; health care; decision support; statistical disclosure control; privacy; open data; stochastic gradient descent; decision tree; k-nearest neighbors; random forest; support vector machine; MICRODATA; RISK;
D O I
10.2196/18910
中图分类号
R-058 [];
学科分类号
摘要
Background: The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective: This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods: A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results: A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions: The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Human activity recognition using 2D skeleton data and supervised machine learning
    Ghazal, Sumaira
    Khan, Umar S.
    Saleem, Muhammad Mubasher
    Rashid, Nasir
    Iqbal, Javaid
    IET IMAGE PROCESSING, 2019, 13 (13) : 2572 - 2578
  • [22] Screening of Enhanced Oil Recovery Methods Using Supervised Machine Learning Predicated on Range Data
    Harrison, Gbubemi H.
    Lamboi, Josephine A.
    APPLIED INTELLIGENCE AND INFORMATICS, AII 2023, 2024, 2065 : 430 - 441
  • [23] Predicting Quality Medical Drug Data Towards Meaningful Data using Machine Learning
    Al-Showarah, Suleyman
    Al-Taie, Abubaker
    Salman, Hamzeh Eyal
    Alzyadat, Wael
    Alkhalaileh, Mohannad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 1052 - 1059
  • [24] Predicting Diabetes Diseases Using Mixed Data and Supervised Machine Learning Algorithms
    Daanouni, Othmane
    Cherradi, Bouchaib
    Tmiri, Amal
    4TH INTERNATIONAL CONFERENCE ON SMART CITY APPLICATIONS (SCA' 19), 2019,
  • [25] Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection
    Luo, Menghua
    Wang, Ke
    Cai, Zhiping
    Liu, Anfeng
    Li, Yangyang
    Cheang, Chak Fong
    CMC-COMPUTERS MATERIALS & CONTINUA, 2019, 58 (01): : 15 - 26
  • [26] Machine Learning, Synthetic Data, and the Politics of Difference
    Jacobsen, Benjamin N.
    THEORY CULTURE & SOCIETY, 2025,
  • [27] A Survey of Synthetic Data Generation for Machine Learning
    Abufadda, Mohammad
    Mansour, Khalid
    2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 488 - 494
  • [28] A Requirement Engineering Framework for Electronic Data Sharing of Health Care Data Between Organizations
    Liu, Xia
    Peyton, Liam
    Kuziemsky, Craig
    E-TECHNOLOGIES-INNOVATION IN AN OPEN WORLD, 2009, 26 : 279 - 289
  • [29] Synthetic data in machine learning for medicine and healthcare
    Chen, Richard J.
    Lu, Ming Y.
    Chen, Tiffany Y.
    Williamson, Drew F. K.
    Mahmood, Faisal
    NATURE BIOMEDICAL ENGINEERING, 2021, 5 (06) : 493 - 497
  • [30] Privacy and Integrity Protection for IoT Multimodal Data Using Machine Learning and Blockchain
    Liu, Qingzhi
    Huang, Yuchen
    Jin, Chenglu
    Zhou, Xiaohan
    Mao, Ying
    Catal, Cagatay
    Cheng, Long
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (06)