Survey on Synthetic Data Generation, Evaluation Methods and GANs

被引:160
作者
Figueira, Alvaro [1 ]
Vaz, Bruno [2 ]
机构
[1] Univ Porto, CRACS INESC TEC, P-4169007 Porto, Portugal
[2] Univ Porto, Fac Sci, Rua Campo Alegre S-N, P-4169007 Porto, Portugal
关键词
synthetic data generation; generative adversarial networks; evaluation of synthetic data; SMOTE;
D O I
10.3390/math10152733
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.
引用
收藏
页数:41
相关论文
共 99 条
[21]   PiiGAN: Generative Adversarial Networks for Pluralistic Image Inpainting [J].
Cai, Weiwei ;
Wei, Zhanguo .
IEEE ACCESS, 2020, 8 :48451-48463
[22]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[23]   Sports Camera Calibration via Synthetic Data [J].
Chen, Jianhui ;
Little, James J. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, :2497-2504
[24]  
Chen SL, 2017, ADV SOC SCI EDUC HUM, V96, P1132
[25]  
Chen X, 2016, ADV NEUR IN, V29
[26]   Learning Semantic Segmentation from Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach [J].
Chen, Yuhua ;
Li, Wen ;
Chen, Xiaoran ;
Van Gool, Luc .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1841-1850
[27]  
Choi E., 2017, PMLR, P286
[28]  
Chokwitthaya C, 2020, CONSTRUCTION RESEARCH CONGRESS 2020: COMPUTER APPLICATIONS, P1251
[29]   APPROXIMATING DISCRETE PROBABILITY DISTRIBUTIONS WITH DEPENDENCE TREES [J].
CHOW, CK ;
LIU, CN .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1968, 14 (03) :462-+
[30]  
Coutinho-Almeida J., 2021, LECT NOTES COMPUT SC, P282, DOI DOI 10.1007/978-3-030-88942-5_22