Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey

被引：0

作者：

Drechsler, Joerg ^{[1
]}

Reiter, J. P. ^{[2
]}

机构：

[1] Inst Employment Res, D-90478 Nurnberg, Germany

[2] Duke Univ, Dept Stat Sci, Durham, NC 27708 USA

来源：

JOURNAL OF OFFICIAL STATISTICS | 2009年 / 25卷 / 04期

基金：

美国国家科学基金会;

关键词：

Confidentiality; disclosures; multiple imputation; synthetic data; MULTIPLE-IMPUTATION; IDENTIFICATION DISCLOSURE; MICRODATA;

D O I：

暂无

中图分类号：

O1 [数学]; C [社会科学总论];

学科分类号：

03 ; 0303 ; 0701 ; 070101 ;

摘要：

Statistical agencies that disseminate data to the public must protect the confidentiality of respondents' identities and sensitive attributes. To satisfy these requirements, agencies can release the units originally surveyed with some values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. These are called partially synthetic data. In this article, we empirically examine trade-offs between inferential accuracy and confidentiality risks for partially synthetic data, with emphasis oil the role of the number of released datasets. We also present a two-stage imputation scheme that allows agencies to release different numbers of imputations for different variables. This scheme can result in lower disclosure risks and higher data utility than the typical one-stage imputation with the same number of released datasets. The empirical analyses are based oil partial synthesis of the German IAB Establishment Survey.

引用

页码：589 / 603

页数：15

共 45 条

[21] Multiple imputation in practice-a case study using a complex German establishment survey
Drechsler, Joerg
ASTA-ADVANCES IN STATISTICAL ANALYSIS, 2011, 95 (01) : 1 - 26
[22] Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
El Emam, Khaled
Mosquera, Lucy
Bass, Jason
JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (11)
[23] Identification Risks Evaluation of Partially Synthetic Data with the Identification Risk Calculation R Package
Hornby, Ryan
Hu, Jingchen
TRANSACTIONS ON DATA PRIVACY, 2021, 14 (01) : 37 - 52
[24] A Multivariate Stopping Rule for Survey Data Collection: Empirical Evaluation from a Panel Study
Zhang, Xinyu
Wagner, James
Elliott, Michael R.
West, Brady T.
Coffey, Stephanie
JOURNAL OF OFFICIAL STATISTICS, 2025, 41 (01) : 468 - 494
[25] Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
El Emam, Khaled
Mosquera, Lucy
Fang, Xi
El-Hussuna, Alaa
JMIR MEDICAL INFORMATICS, 2022, 10 (04) : 185 - 195
[26] Synthetic healthcare data utility with biometric pattern recognition using adversarial networks
Khadidos, Adil O.
Manoharan, Hariprasath
Khadidos, Alaa O.
Selvarajan, Shitharth
Singh, Subhav
SCIENTIFIC REPORTS, 2025, 15 (01):
[27] Utility of GAN generated synthetic data for cardiovascular diseases mortality prediction: an experimental study
Khan, Shahzad Ahmed
Murtaza, Hajra
Ahmed, Musharif
HEALTH AND TECHNOLOGY, 2024, 14 (03) : 557 - 580
[28] Empirical Analysis of Attribute-Aware Recommender System Algorithms Using Synthetic Data
Tso, Karen H. L.
Schmidt-Thieme, Lars
JOURNAL OF COMPUTERS, 2006, 1 (04) : 18 - 29
[29] A Study of Using Synthetic Data for Effective Association Knowledge Learning
Liu, Yuchi
Wang, Zhongdao
Zhou, Xiangxin
Zheng, Liang
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 194 - 206
[30] A Survey on Privacy Preserving Synthetic Data Generation and a Discussion on a Privacy-Utility Trade-off Problem
Ghatak, Debolina
Sakurai, Kouichi
SCIENCE OF CYBER SECURITY, SCISEC 2022 WORKSHOPS, 2022, 1680 : 167 - 180

← 1 2 3 4 5 →