Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning

被引:162
作者
Engelmann, Justin [1 ]
Lessmann, Stefan [1 ]
机构
[1] Humboldt Univ, Sch Business & Econ, Unter Linden 6, D-10099 Berlin, Germany
关键词
Imbalanced learning; Generative adversarial networks; Credit scoring; Oversampling; ART CLASSIFICATION ALGORITHMS; CREDIT;
D O I
10.1016/j.eswa.2021.114582
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance impedes the predictive performance of classification models. Popular countermeasures include oversampling minority class cases by creating synthetic examples. The paper examines the potential of Generative Adversarial Networks (GANs) for oversampling. A few prior studies have used GANs for this purpose but do not reflect recent methodological advancements for generating tabular data using GANs. The paper proposes an approach based on a conditional Wasserstein GAN that can effectively model tabular datasets with numerical and categorical variables and pays special attention to the down-stream classification task through an auxiliary classifier loss. We focus on a credit scoring context in which binary classifiers predict the default risk of loan applications. Empirical comparisons in this context evidence the competitiveness of GAN-based oversampling compared to several standard oversampling regimes. We also clarify the conditions under which oversampling in general and the proposed GAN-based approach in particular raise predictive performance. In sum, our findings suggest that GAN architectures for tabular data and our extensions deserve a place in data scientists' modelling toolbox.
引用
收藏
页数:13
相关论文
共 37 条
[1]  
Ahmed F., 2017, ARXIV PREPRINT ARXIV
[2]  
[Anonymous], 2019, INFORM SCIENCES, DOI DOI 10.1016/j.ins.2017.12.030
[3]  
[Anonymous], 2014, arXiv
[4]  
[Anonymous], 2014, 27THINT C NEURAL INF
[5]   Benchmarking state-of-the-art classification algorithms for credit scoring [J].
Baesens, B ;
Van Gestel, T ;
Viaene, S ;
Stepanova, M ;
Suykens, J ;
Vanthienen, J .
JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2003, 54 (06) :627-635
[6]   Synthesizing electronic health records using improved generative adversarial networks [J].
Baowaly, Mrinal Kanti ;
Lin, Chia-Ching ;
Liu, Chao-Lin ;
Chen, Kuan-Ta .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2019, 26 (03) :228-241
[7]  
Bellemare M.G., 2017, ARXIV PREPRINT ARXIV
[8]  
Bengio Y., 2013, ARXIV PREPRINT ARXIV
[9]   Approaches for credit scorecard calibration: An empirical analysis [J].
Beque, Artem ;
Coussement, Kristof ;
Gayler, Ross ;
Lessmann, Stefan .
KNOWLEDGE-BASED SYSTEMS, 2017, 134 :213-227
[10]  
Bottou Leon, 2017, WASSERSTEIN GAN