Imbalance factor: a simple new scale for measuring inter-class imbalance extent in classification problems

被引：4

作者：

Pirizadeh, Mohsen ^{[1
]}

Farahani, Hadi ^{[1
]}

Kheradpisheh, Saeed Reza ^{[1
]}

机构：

[1] Shahid Beheshti Univ, Fac Math Sci, Dept Comp & Data Sci, Tehran, Iran

来源：

KNOWLEDGE AND INFORMATION SYSTEMS | 2023年 / 65卷 / 10期

关键词：

Class imbalance; Skewed class distribution; Imbalance extent; Information theory; ALGORITHM; RATIO;

D O I：

10.1007/s10115-023-01881-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning from datasets that suffer from differences in absolute frequency of classes is one of the most challenging tasks in the machine learning field. Efforts have been made to tackle the problem of class imbalance by providing solutions at data and algorithmic levels. In these cases, in order to categorize the solutions according to problem class imbalance level and to obtain meaningful and consistent interpretations from the experiments, it is essential to be able to quantify the extent of dataset imbalance. A competent scale to summarize the severity of data inter-class imbalance, requires to meet at least the following three conditions: (1) the ability to calculate the imbalance extent for both binary and multi-class datasets, (2) output within a definite and fixed range of values, (3) being correlated with the performance of different classifiers. Nevertheless, none of the scales introduced so far satisfy all the enumerated requirements. In this study, we propose an informative scale called imbalance factor (IF) based on information theory, which, independent of the number of data classes, quantifies dataset imbalance extent in a single value in the range of [0, 1]. Besides, IF offers various limiting cases with different growth rates according to its alpha order. This property is critical as it can settle the possibility of having the same extent for distinct distributions. Eventually, empirical experiments indicate that with an average correlation of 0.766 with the classification accuracies over 15 real datasets, IF is remarkably more sensitive to class imbalance changes than other previous scales.

引用

页码：4157 / 4183

页数：27

共 36 条

[1] Arndt C., 2004, Signals and Communication Technology
[2] Assessing the data complexity of imbalanced datasets
Barella, Victor H.
Garcia, Luis P. F.
de Souto, Marcilio C. P.
Lorena, Ana C.
de Carvalho, Andre C. P. L. F.
[J]. INFORMATION SCIENCES, 2021, 553 : 83 - 109
[3] LoRAS: an oversampling approach for imbalanced datasets
Bej, Saptarshi
Davtyan, Narek
Wolfien, Markus
Nassar, Mariam
Wolkenhauer, Olaf
[J]. MACHINE LEARNING, 2021, 110 (02) : 279 - 301
[4] Framework for extreme imbalance classification: SWIM-sampling with the majority class
Bellinger, Colin
Sharma, Shiven
Japkowicz, Nathalie
Zaiane, Osmar R.
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (03) : 841 - 866
[5] Manifold-based synthetic oversampling with manifold conformance estimation
Bellinger, Colin
Drummond, Christopher
Japkowicz, Nathalie
[J]. MACHINE LEARNING, 2018, 107 (03) : 605 - 637
[6] Relevance-Based Evaluation Metrics for Multi-class Imbalanced Domains
Branco, Paula
Torgo, Luis
Ribeiro, Rita P.
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT I, 2017, 10234 : 698 - 710
[7] The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
Brzezinski, Dariusz
Minku, Leandro L.
Pewinski, Tomasz
Stefanowski, Jerzy
Szumaczuk, Artur
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (06) : 1429 - 1469
[8] Conrad K., 2004, ENTROPY-SWITZ, V6, P10
[9] Studies on incidence pattern recognition based on information entropy
Ding, SF
Shi, ZZ
[J]. JOURNAL OF INFORMATION SCIENCE, 2005, 31 (06) : 497 - 502
[10] A multiple resampling method for learning from imbalanced data sets
Estabrooks, A
Jo, TH
Japkowicz, N
[J]. COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) : 18 - 36

← 1 2 3 4 →