Deep Generative Models for Synthetic Data: A Survey

被引：22

作者：

Eigenschink, Peter ^{[1
]}

Reutterer, Thomas ^{[1
]}

Vamosi, Stefan ^{[1
]}

Vamosi, Ralf ^{[1
,2
]}

Sun, Chang ^{[3
]}

Kalcher, Klaudius ^{[4
]}

机构：

[1] Vienna Univ Econ & Business, Dept Mkt, A-1020 Vienna, Austria

[2] Vienna Univ Technol, High Performance Comp, A-1040 Vienna, Austria

[3] Maastricht Univ, Inst Data Sci, NL-6200 MD Maastricht, Netherlands

[4] Mostly AI GmbH, A-1030 Vienna, Austria

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Data models; Synthetic data; Measurement; Biological system modeling; Analytical models; Training data; Medical services; Artificial intelligence; big data; deep learning; generative models; neural networks; synthetic data; privacy; NATURAL-LANGUAGE GENERATION; PREDICTION;

D O I：

10.1109/ACCESS.2023.3275134

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.

引用

页码：47304 / 47320

页数：17

共 96 条

[1] Deep Learning with Differential Privacy
Abadi, Martin
Chu, Andy
Goodfellow, Ian
McMahan, H. Brendan
Mironov, Ilya
Talwar, Kunal
Zhang, Li
[J]. CCS'16: PROCEEDINGS OF THE 2016 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2016, : 308 - 318
[2] Huang CZA, 2018, Arxiv, DOI arXiv:1809.04281
[3] [Anonymous], 2015, Digital Video Processing
[4] Arjovsky M, 2017, Arxiv, DOI [arXiv:1701.07875, 10.48550/arXiv.1701.07875]
[5] Assefa SA, 2021, Proceedings of the First ACM International Conference on AI in Finance, P1, DOI 10.1145/3383455.3422554
[6] Brown TB, 2020, Arxiv, DOI arXiv:2005.14165
[7] Baltrusaitis T, 2016, IEEE WINT CONF APPL
[8] Synthesizing electronic health records using improved generative adversarial networks
Baowaly, Mrinal Kanti
Lin, Chia-Ching
Liu, Chao-Lin
Chen, Kuan-Ta
[J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2019, 26 (03) : 228 - 241
[9] Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing
Beaulieu-Jones, Brett K.
Wu, Zhiwei Steven
Williams, Chris
Lee, Ran
Bhavnani, Sanjeev P.
Byrd, James Brian
Greene, Casey S.
[J]. CIRCULATION-CARDIOVASCULAR QUALITY AND OUTCOMES, 2019, 12 (07):
[10] Synthesizing Plausible Privacy-Preserving Location Traces
Bindschaedler, Vincent
Shokri, Reza
[J]. 2016 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), 2016, : 546 - 563

← 1 2 3 4 5 6 7 8 9 10 →