Generation and evaluation of synthetic patient data

被引:197
作者
Goncalves, Andre [1 ]
Ray, Priyadip [1 ]
Soper, Braden [1 ]
Stevens, Jennifer [2 ]
Coyle, Linda [2 ]
Sales, Ana Paula [1 ]
机构
[1] Lawrence Livermore Natl Lab, 7000 East Ave, Livermore, CA 94550 USA
[2] Informat Management Syst, 1455 Res Blvd,Suite 315, Rockville, MD USA
基金
美国国家卫生研究院;
关键词
Synthetic data generation; Cancer patient data; Information disclosure; Generative models; PRIVACY; RISK; TEXT;
D O I
10.1186/s12874-020-00977-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
引用
收藏
页数:40
相关论文
共 51 条
[1]   Deep Learning with Differential Privacy [J].
Abadi, Martin ;
Chu, Andy ;
Goodfellow, Ian ;
McMahan, H. Brendan ;
Mironov, Ilya ;
Talwar, Kunal ;
Zhang, Li .
CCS'16: PROCEEDINGS OF THE 2016 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2016, :308-318
[2]  
[Anonymous], J PRIV CONFIDENTIALI
[3]  
[Anonymous], 2017, P INT C LEARN REPR
[4]  
[Anonymous], INT S FDN HLTH INF E
[5]  
[Anonymous], CORR
[6]  
[Anonymous], 2017, ADV NEURAL INFORM PR
[7]  
[Anonymous], 2014, arXiv
[8]  
[Anonymous], BLOOMB DAT GOOD EXCH
[9]   Multiple imputation by chained equations: what is it and how does it work? [J].
Azur, Melissa J. ;
Stuart, Elizabeth A. ;
Frangakis, Constantine ;
Leaf, Philip J. .
INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, 2011, 20 (01) :40-49
[10]   Banknote Simulator for Aging and Soiling Banknotes using Gaussian Models and Perlin Noise [J].
Baek, Sangwook ;
Lee, Sanghun ;
Choi, Euison ;
Baek, Yoonkil ;
Lee, Chulhee .
ICPRAM: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2017, :289-292