CTAB-GAN plus : enhancing tabular data synthesis

被引:13
作者
Zhao, Zilong [1 ,2 ]
Kunar, Aditya [1 ]
Birke, Robert [3 ]
van der Scheer, Hiek [4 ]
Chen, Lydia Y. [1 ,5 ]
机构
[1] Delft Univ Technol, Fac Elect Engn Math & Comp Sci, Delft, Netherlands
[2] Tech Univ Munich, Sch Social Sci & Technol, Munich, Germany
[3] Univ Turin, Comp Sci Dept, Turin, Italy
[4] AEGON, The Hague, Netherlands
[5] Univ Neuchatel, Comp Sci Dept, Neuchatel, Switzerland
来源
FRONTIERS IN BIG DATA | 2024年 / 6卷
关键词
GAN; data synthesis; tabular data; differential privacy; imbalanced distribution;
D O I
10.3389/fdata.2023.1296508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.
引用
收藏
页数:17
相关论文
共 35 条
[1]   Deep Learning with Differential Privacy [J].
Abadi, Martin ;
Chu, Andy ;
Goodfellow, Ian ;
McMahan, H. Brendan ;
Mironov, Ilya ;
Talwar, Kunal ;
Zhang, Li .
CCS'16: PROCEEDINGS OF THE 2016 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2016, :308-318
[2]  
Arjovsky M, 2017, PR MACH LEARN RES, V70
[3]  
Bellemare MG, 2017, Arxiv, DOI arXiv:1705.10743
[4]  
BISHOP C. M., 2006, PATTERN RECOGN
[5]  
Chen DF, 2021, Arxiv, DOI arXiv:2006.08265
[6]   GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models [J].
Chen, Dingfan ;
Yu, Ning ;
Zhang, Yang ;
Fritz, Mario .
CCS '20: PROCEEDINGS OF THE 2020 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2020, :343-362
[7]  
Chen Ricky T. Q., 2018, Advances in Neural Information Processing Systems, V31
[8]  
Chen Xiangyi, 2020, Advances in Neural Information Processing Systems, V33
[9]  
Choi E, 2018, Arxiv, DOI [arXiv:1703.06490, DOI 10.48550/ARXIV.1703.06490]
[10]  
Dwork C., 2013, The Algorithmic Foundations of Differential Privacy, DOI [10.1561/9781601988195, DOI 10.1561/9781601988195]