Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS

被引:15
|
作者
Loong, Bronwyn [1 ,4 ]
Zaslavsky, Alan M. [2 ]
He, Yulei [2 ]
Harrington, David P. [3 ]
机构
[1] Australian Natl Univ, Res Sch Finance Actuarial Studies & Appl Stat, Canberra, ACT 0200, Australia
[2] Harvard Univ, Sch Med, Dept Hlth Care Policy, Boston, MA 02115 USA
[3] Dana Farber Canc Inst, Dept Biostat & Computat Biol, Boston, MA 02215 USA
[4] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
关键词
data confidentiality; data utility; disclosure risk; multiple imputation; synthetic data; MULTIPLE-IMPUTATION; LIKELIHOOD; SELECTION; TESTS;
D O I
10.1002/sim.5841
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. Copyright (c) 2013 John Wiley & Sons, Ltd.
引用
收藏
页码:4139 / 4161
页数:23
相关论文
共 28 条
  • [1] Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey
    Drechsler, Joerg
    Reiter, J. P.
    JOURNAL OF OFFICIAL STATISTICS, 2009, 25 (04) : 589 - 603
  • [2] Generating partially synthetic geocoded public use data with decreased disclosure risk by using differential smoothing
    Quick, Harrison
    Holan, Scott H.
    Wikle, Christopher K.
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2018, 181 (03) : 649 - 661
  • [3] Combining synthetic data with subsampling to create public use microdata files for large scale surveys
    Drechsler, Joerg
    Reiter, Jerome P.
    SURVEY METHODOLOGY, 2012, 38 (01) : 73 - 79
  • [4] Deciphering gene expression patterns using large-scale transcriptomic data and its applications
    Chen, Shunjie
    Wang, Pei
    Guo, Haiping
    Zhang, Yujie
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (06)
  • [5] Large-Scale Data Analysis Using Heuristic Methods
    Dzemyda, Gintautas
    Sakalauskas, Leonidas
    INFORMATICA, 2011, 22 (01) : 1 - 10
  • [6] On using stratified two-stage sampling for large-scale multispecies surveys
    Aubry, Philippe
    Quaintenne, Gwenael
    Dupuy, Jeremy
    Francesiaz, Charlotte
    Guillemain, Matthieu
    Caizergues, Alain
    ECOLOGICAL INFORMATICS, 2023, 77
  • [7] Large-scale secure model learning and inference using synthetic data for IoT-based big data analytics
    Tekchandani, Prakash
    Das, Ashok Kumar
    Kumar, Neeraj
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
  • [8] Using plausible values when fitting multilevel models with large-scale assessment data using R
    Huang, Francis L.
    LARGE-SCALE ASSESSMENTS IN EDUCATION, 2024, 12 (01)
  • [9] Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey
    Schenker, Nathaniel
    Raghunathan, Trivellore E.
    Bondarenko, Irina
    STATISTICS IN MEDICINE, 2010, 29 (05) : 533 - 545
  • [10] Normalization and integration of large-scale metabolomics data using support vector regression
    Shen, Xiaotao
    Gong, Xiaoyun
    Cai, Yuping
    Guo, Yuan
    Tu, Jia
    Li, Hao
    Zhang, Tao
    Wang, Jialin
    Xue, Fuzhong
    Zhu, Zheng-Jiang
    METABOLOMICS, 2016, 12 (05)