Challenges and opportunities of generative models on tabular data

被引:17
作者
Wang, Alex X. [1 ]
Chukova, Stefanka S. [1 ]
Simpson, Colin R. [2 ,3 ]
Nguyen, Binh P. [1 ]
机构
[1] Victoria Univ Wellington, Sch Math & Stat, Wellington 6012, New Zealand
[2] Victoria Univ Wellington, Wellington Fac Hlth, Wellington 6012, New Zealand
[3] Univ Edinburgh, Usher Inst, Edinburgh, Scotland
关键词
Tabular data synthesis; Deep generative models; SMOTE; Benchmark; Heterogeneous data; SYNTHETIC DATA GENERATION; SMOTE;
D O I
10.1016/j.asoc.2024.112223
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tabular data, organized like tables with rows and columns, is widely used. Existing models for tabular data synthesis often face limitations related to data size or complexity. In contrast, deep generative models, a part of deep learning, demonstrate proficiency in handling large and complex data sets. While these models have shown remarkable success in generating image and audio data, their application in tabular data synthesis is relatively new, lacking a comprehensive comparison with existing methods. To fill this gap, this study aims to systematically evaluate and compare the performance of deep generative models with these existing methods for tabular data synthesis, while also investigating the efficacy of post-processing techniques. We aim to identify strengths and limitations and provide insights for future research and practical applications. Our study showed that the Synthetic Minority Oversampling Technique (SMOTE) and its variants outperform deep generative models, especially for small datasets. However, we observed that an ensemble of deep generative models and post-generation processing performs better on large datasets than SMOTE alone. The results of our study indicate that deep generative models hold promise as a valuable tool for generating tabular data. Nonetheless, further research is warranted to enhance the performance of deep generative models and gain a comprehensive understanding of their limitations.
引用
收藏
页数:15
相关论文
共 57 条
[1]  
Antoniou A, 2018, Arxiv, DOI arXiv:1711.04340
[2]  
Batista GEAPA., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI DOI 10.1145/1007730.1007735
[3]   Phi, Fei, Fo, Fum: Effect Sizes for Categorical Data That Use the Chi-Squared Statistic [J].
Ben-Shachar, Mattan S. ;
Patil, Indrajeet ;
Theriault, Remi ;
Wiernik, Brenton M. ;
Luedecke, Daniel .
MATHEMATICS, 2023, 11 (09)
[4]   Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models [J].
Bond-Taylor, Sam ;
Leach, Adam ;
Long, Yang ;
Willcocks, Chris G. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (11) :7327-7347
[5]   Deep Neural Networks and Tabular Data: A Survey [J].
Borisov, Vadim ;
Leemann, Tobias ;
Sessler, Kathrin ;
Haug, Johannes ;
Pawelczyk, Martin ;
Kasneci, Gjergji .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (06) :7499-7519
[6]   Encoding High-Cardinality String Categorical Variables [J].
Cerda, Patricio ;
Varoquaux, Gael .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (03) :1164-1176
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]   A Survey of Automated Data Augmentation for Image Classification: Learning to Compose, Mix, and Generate [J].
Cheung, Tsz-Him ;
Yeung, Dit-Yan .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) :13185-13205
[9]  
Cote MP, 2020, Arxiv, DOI [arXiv:2008.06110, 10.48550/arXiv.2008.06110, DOI 10.48550/ARXIV.2008.06110]
[10]   Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification [J].
Cullen, Drake ;
Halladay, James ;
Briner, Nathan ;
Basnet, Ram ;
Bergen, Jeremy ;
Doleck, Tenzin .
IEEE ACCESS, 2022, 10 :129612-129625