Survey on Synthetic Data Generation, Evaluation Methods and GANs

被引:160
作者
Figueira, Alvaro [1 ]
Vaz, Bruno [2 ]
机构
[1] Univ Porto, CRACS INESC TEC, P-4169007 Porto, Portugal
[2] Univ Porto, Fac Sci, Rua Campo Alegre S-N, P-4169007 Porto, Portugal
关键词
synthetic data generation; generative adversarial networks; evaluation of synthetic data; SMOTE;
D O I
10.3390/math10152733
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.
引用
收藏
页数:41
相关论文
共 99 条
[1]  
Aditsania A, 2017, 2017 3RD INTERNATIONAL CONFERENCE ON SCIENCE IN INFORMATION TECHNOLOGY (ICSITECH), P533, DOI 10.1109/ICSITech.2017.8257170
[2]  
Alaa Ahmed M., 2021, ArXiv, Vabs/2102.08921
[3]  
Alanazi Y., 2021, ARXIV
[4]   MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network [J].
Ali-Gombe, Adamu ;
Elyan, Eyad .
NEUROCOMPUTING, 2019, 361 :212-221
[5]  
Ali-Gombe Adamu., 2018, 2018 International Joint Conference on Neural Networks (IJCNN), P1
[6]  
Andrews Gerard, 2021, What is Synthetic Data?
[7]  
[Anonymous], 2003, WORKSHOP LEARNING IM
[8]  
[Anonymous], 2004, ACM Sigkdd Explorations Newsletter
[9]  
[Anonymous], YANN LECUN QUORA SES
[10]  
[Anonymous], Comprehensive Introduction to Physics, PDEs, and Numerical Modeling