Blending is all you need: Data-centric ensemble synthetic data

被引：0

作者：

Wang, Alex X. ^{[1
]}

Simpson, Colin R. ^{[2
,3
]}

Nguyen, Binh P. ^{[1
]}

机构：

[1] Victoria Univ Wellington, Sch Math & Stat, Wellington 6012, New Zealand

[2] Victoria Univ Wellington, Wellington Fac Hlth, Wellington 6012, New Zealand

[3] Univ Edinburgh, Usher Inst, Edinburgh, Midlothian, Scotland

来源：

INFORMATION SCIENCES | 2025年 / 691卷

关键词：

Generative model; Data-centric AI; Classification; Ensemble; Centroid displacement; Tabular data;

D O I：

10.1016/j.ins.2024.121610

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep generative models have gained increasing popularity, particularly in fields such as natural language processing and computer vision. Recently, efforts have been made to extend these advanced algorithms to tabular data. While generative models have shown promising results in creating synthetic data, their high computational demands and the need for careful parameter tuning present significant challenges. This study investigates whether a collective integration of refined synthetic datasets from multiple models can achieve comparable or superior performance to that of a single, large generative model. To this end, we developed a Data-Centric Ensemble Synthetic Data model, leveraging principles of ensemble learning. Our approach involved a data refinement process applied to various synthetic datasets, systematically eliminating noise and ranking, selecting, and combining them to create an augmented, high-quality synthetic dataset. This approach improved both the quantity and quality of the data. Central to this process, we introduced the Ensemble kappa-Nearest Neighbors with Centroid Displacement (EKCD) algorithm for noise filtering, alongside a density score for ranking and selecting data. Our experiments confirmed the effectiveness of EKCD in removing noisy synthetic samples. Additionally, the ensemble model based on the refined synthetic data substantially enhanced the performance of machine learning models, sometimes even outperforming that of real data.

引用

页数：15

共 36 条

[1] Optuna: A Next-generation Hyperparameter Optimization Framework [J].

Akiba, Takuya ;

Sano, Shotaro ;

Yanase, Toshihiko ;

Ohta, Takeru ;

Koyama, Masanori .

KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, :2623-2631

[2]

Batista G. E. A. P. A., 2004, ACM SIGKDD explorations newsletter, V6, P20, DOI [10.1145/1007730.1007735, 10.1145/1007730.1007735.2]

[3] Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets [J].

Belkina, Anna C. ;

Ciccolella, Christopher O. ;

Anno, Rina ;

Halpert, Richard ;

Spidlen, Josef ;

Snyder-Cappione, Jennifer E. .

NATURE COMMUNICATIONS, 2019, 10 (1)

[4]

Borisov V., 2023, INT C LEARN REPR ICL, P1

[5] Deep Neural Networks and Tabular Data: A Survey [J].

Borisov, Vadim ;

Leemann, Tobias ;

Sessler, Kathrin ;

Haug, Johannes ;

Pawelczyk, Martin ;

Kasneci, Gjergji .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (06) :7499-7519

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7] Highly imbalanced fault classification of wind turbines using data resampling and hybrid ensemble method approach [J].

Chatterjee, Subhajit ;

Byun, Yung-Cheol .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126

[8] A survey on ensemble learning [J].

Dong, Xibin ;

Yu, Zhiwen ;

Cao, Wenming ;

Shi, Yifan ;

Ma, Qianli .

FRONTIERS OF COMPUTER SCIENCE, 2020, 14 (02) :241-258

[9] Tabular and latent space synthetic data generation: a literature review [J].

Fonseca, Joao ;

Bacao, Fernando .

JOURNAL OF BIG DATA, 2023, 10 (01)

[10] A Systematic Survey on Deep Generative Models for Graph Generation [J].

Guo, Xiaojie ;

Zhao, Liang .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (05) :5370-5390

← 1 2 3 4 →