Generation and evaluation of synthetic patient data

被引:197
作者
Goncalves, Andre [1 ]
Ray, Priyadip [1 ]
Soper, Braden [1 ]
Stevens, Jennifer [2 ]
Coyle, Linda [2 ]
Sales, Ana Paula [1 ]
机构
[1] Lawrence Livermore Natl Lab, 7000 East Ave, Livermore, CA 94550 USA
[2] Informat Management Syst, 1455 Res Blvd,Suite 315, Rockville, MD USA
基金
美国国家卫生研究院;
关键词
Synthetic data generation; Cancer patient data; Information disclosure; Generative models; PRIVACY; RISK; TEXT;
D O I
10.1186/s12874-020-00977-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
引用
收藏
页数:40
相关论文
共 51 条
[41]   Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation [J].
Sankaranarayanan, Swami ;
Balaji, Yogesh ;
Jain, Arpit ;
Lim, Ser Nam ;
Chellappa, Rama .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3752-3761
[42]   Simulation of Synthetic Complex Data: The R Package simPop [J].
Templ, Matthias ;
Meindl, Bernhard ;
Kowarik, Alexander ;
Dupriez, Olivier .
JOURNAL OF STATISTICAL SOFTWARE, 2017, 79 (10) :1-38
[43]   Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization [J].
Tremblay, Jonathan ;
Prakash, Aayush ;
Acuna, David ;
Brophy, Mark ;
Jampani, Varun ;
Anil, Cem ;
To, Thang ;
Cameracci, Eric ;
Boochoon, Shaad ;
Birchfield, Stan .
PROCEEDINGS 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2018, :1082-1090
[44]   Protecting Privacy in Large Datasets-First We Assess the Risk; Then We Fuzzy the Data [J].
Ursin, Giske ;
Sen, Sagar ;
Mottu, Jean-Marie ;
Nygard, Mari .
CANCER EPIDEMIOLOGY BIOMARKERS & PREVENTION, 2017, 26 (08) :1219-1224
[45]   Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record [J].
Walonoski, Jason ;
Kramer, Mark ;
Nichols, Joseph ;
Quina, Andre ;
Moesel, Chris ;
Hall, Dylan ;
Duffett, Carlton ;
Dube, Kudakwashe ;
Gallagher, Thomas ;
McLachlan, Scott .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2018, 25 (03) :230-238
[46]   Dopamine crosslinked graphene oxide membrane for simultaneous removal of organic pollutants and trace heavy metals from aqueous solution [J].
Wang, Jing ;
Huang, Tiefan ;
Zhang, Lin ;
Yu, Qiming Jimmy ;
Hou, Li'an .
ENVIRONMENTAL TECHNOLOGY, 2018, 39 (23) :3055-3065
[47]  
Woo M.-J., 2009, J PRIVACY CONFIDENTI, V1, P111, DOI DOI 10.29012/JPC.V1I1.568
[48]   Differential Privacy via Wavelet Transforms [J].
Xiao, Xiaokui ;
Wang, Guozhang ;
Gehrke, Johannes .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (08) :1200-1214
[49]  
Xie L., 2018, CoRR
[50]   PrivBayes: Private Data Release via Bayesian Networks [J].
Zhang, Jun ;
Cormode, Graham ;
Procopiuc, Cecilia M. ;
Srivastava, Divesh ;
Xiao, Xiaokui .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2017, 42 (04)