Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey

被引:0
|
作者
Drechsler, Joerg [1 ]
Reiter, J. P. [2 ]
机构
[1] Inst Employment Res, D-90478 Nurnberg, Germany
[2] Duke Univ, Dept Stat Sci, Durham, NC 27708 USA
基金
美国国家科学基金会;
关键词
Confidentiality; disclosures; multiple imputation; synthetic data; MULTIPLE-IMPUTATION; IDENTIFICATION DISCLOSURE; MICRODATA;
D O I
暂无
中图分类号
O1 [数学]; C [社会科学总论];
学科分类号
03 ; 0303 ; 0701 ; 070101 ;
摘要
Statistical agencies that disseminate data to the public must protect the confidentiality of respondents' identities and sensitive attributes. To satisfy these requirements, agencies can release the units originally surveyed with some values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. These are called partially synthetic data. In this article, we empirically examine trade-offs between inferential accuracy and confidentiality risks for partially synthetic data, with emphasis oil the role of the number of released datasets. We also present a two-stage imputation scheme that allows agencies to release different numbers of imputations for different variables. This scheme can result in lower disclosure risks and higher data utility than the typical one-stage imputation with the same number of released datasets. The empirical analyses are based oil partial synthesis of the German IAB Establishment Survey.
引用
收藏
页码:589 / 603
页数:15
相关论文
共 45 条
  • [21] Multiple imputation in practice-a case study using a complex German establishment survey
    Drechsler, Joerg
    ASTA-ADVANCES IN STATISTICAL ANALYSIS, 2011, 95 (01) : 1 - 26
  • [22] Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
    El Emam, Khaled
    Mosquera, Lucy
    Bass, Jason
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (11)
  • [23] Identification Risks Evaluation of Partially Synthetic Data with the Identification Risk Calculation R Package
    Hornby, Ryan
    Hu, Jingchen
    TRANSACTIONS ON DATA PRIVACY, 2021, 14 (01) : 37 - 52
  • [24] A Multivariate Stopping Rule for Survey Data Collection: Empirical Evaluation from a Panel Study
    Zhang, Xinyu
    Wagner, James
    Elliott, Michael R.
    West, Brady T.
    Coffey, Stephanie
    JOURNAL OF OFFICIAL STATISTICS, 2025, 41 (01) : 468 - 494
  • [25] Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
    El Emam, Khaled
    Mosquera, Lucy
    Fang, Xi
    El-Hussuna, Alaa
    JMIR MEDICAL INFORMATICS, 2022, 10 (04) : 185 - 195
  • [26] Synthetic healthcare data utility with biometric pattern recognition using adversarial networks
    Khadidos, Adil O.
    Manoharan, Hariprasath
    Khadidos, Alaa O.
    Selvarajan, Shitharth
    Singh, Subhav
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [27] Utility of GAN generated synthetic data for cardiovascular diseases mortality prediction: an experimental study
    Khan, Shahzad Ahmed
    Murtaza, Hajra
    Ahmed, Musharif
    HEALTH AND TECHNOLOGY, 2024, 14 (03) : 557 - 580
  • [28] Empirical Analysis of Attribute-Aware Recommender System Algorithms Using Synthetic Data
    Tso, Karen H. L.
    Schmidt-Thieme, Lars
    JOURNAL OF COMPUTERS, 2006, 1 (04) : 18 - 29
  • [29] A Study of Using Synthetic Data for Effective Association Knowledge Learning
    Liu, Yuchi
    Wang, Zhongdao
    Zhou, Xiangxin
    Zheng, Liang
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 194 - 206
  • [30] A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem
    Ghatak, Debolina
    Sakurai, Kouichi
    SCIENCE OF CYBER SECURITY, SCISEC 2022 WORKSHOPS, 2022, 1680 : 167 - 180