Tree boosting methods for balanced and imbalanced classification and their robustness over time in risk assessment

被引：2

作者：

Velarde, Gissel ^{[1
]}

Weichert, Michael ^{[1
]}

Deshmunkh, Anuj ^{[1
]}

Deshmane, Sanjay ^{[1
]}

Sudhir, Anindya ^{[1
]}

Sharma, Khushboo ^{[1
]}

Joshi, Vaibhav ^{[1
]}

机构：

[1] Vodafone GmbH, Ferdinand Pl 1, D-40549 Dusseldorf, Germany

来源：

INTELLIGENT SYSTEMS WITH APPLICATIONS | 2024年 / 22卷

关键词：

Balanced & imbalanced classification; XGBoost; Machine learning; AI; Risk assessment; Performance evaluation;

D O I：

10.1016/j.iswa.2024.200354

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. For tabular data, tree-based methods such as XGBoost, stand out in several benchmarks due to detection performance and speed. Therefore, XGBoost and Imbalance-XGBoost are evaluated. After introducing the motivation to address risk assessment with machine learning, the paper reviews evaluation metrics for detection systems or binary classifiers. It proposes a method for data preparation followed by tree boosting methods including hyper- parameter optimization. The method is evaluated on private datasets of 1 thousand (K), 10K and 100K samples on distributions with 50, 45, 25, and 5 percent positive samples. As expected, the developed method increases its recognition performance as more data is given for training and the F1 score decreases as the data distribution becomes more imbalanced, but it is still significantly superior to the baseline of precision-recall determined by the ratio of positives divided by positives and negatives. Sampling to balance the training set does not provide consistent improvement and deteriorates detection. In contrast, classifier hyper-parameter optimization improves recognition, but should be applied carefully depending on data volume and distribution. Finally, the developed method is robust to data variation over time up to some point. Retraining can be used when performance starts deteriorating.

引用

页数：9

共 19 条

[1]

Bergstra J, 2012, J MACH LEARN RES, V13, P281

[2] XGBoost: A Scalable Tree Boosting System [J].

Chen, Tianqi ;

Guestrin, Carlos .

KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794

[3] The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation [J].

Chicco, Davide ;

Totsch, Niklas ;

Jurman, Giuseppe .

BIODATA MINING, 2021, 14 (01) :1-22

[4] Greedy function approximation: A gradient boosting machine [J].

Friedman, JH .

ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232

[5] Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework [J].

Hajek, Petr ;

Abedin, Mohammad Zoynul ;

Sivarajah, Uthayasankar .

INFORMATION SYSTEMS FRONTIERS, 2023, 25 (05) :1985-2003

[6]

Howell J., 2021, Telecom fraud on the rise: 2021 cfca global telecommunications fraud loss survey

[7] An empirical evaluation of sampling methods for the classification of imbalanced data [J].

Kim, Misuk ;

Hwang, Kyu-Baek .

PLOS ONE, 2022, 17 (07)

[8]

Lemaître G, 2017, J MACH LEARN RES, V18

[9] Imbalanced least squares regression with adaptive weight learning [J].

Li, Yanting ;

Jin, Junwei ;

Ma, Jiangtao ;

Zhu, Fubao ;

Jin, Baohua ;

Liang, Jing ;

Chen, C. L. Philip .

INFORMATION SCIENCES, 2023, 648

[10]

McDonald C., 2021, Leveraging machine learning to detect fraud: Tips to developing a winning Kaggle solution

← 1 2 →