Imbalance factor: a simple new scale for measuring inter-class imbalance extent in classification problems

被引:4
作者
Pirizadeh, Mohsen [1 ]
Farahani, Hadi [1 ]
Kheradpisheh, Saeed Reza [1 ]
机构
[1] Shahid Beheshti Univ, Fac Math Sci, Dept Comp & Data Sci, Tehran, Iran
关键词
Class imbalance; Skewed class distribution; Imbalance extent; Information theory; ALGORITHM; RATIO;
D O I
10.1007/s10115-023-01881-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from datasets that suffer from differences in absolute frequency of classes is one of the most challenging tasks in the machine learning field. Efforts have been made to tackle the problem of class imbalance by providing solutions at data and algorithmic levels. In these cases, in order to categorize the solutions according to problem class imbalance level and to obtain meaningful and consistent interpretations from the experiments, it is essential to be able to quantify the extent of dataset imbalance. A competent scale to summarize the severity of data inter-class imbalance, requires to meet at least the following three conditions: (1) the ability to calculate the imbalance extent for both binary and multi-class datasets, (2) output within a definite and fixed range of values, (3) being correlated with the performance of different classifiers. Nevertheless, none of the scales introduced so far satisfy all the enumerated requirements. In this study, we propose an informative scale called imbalance factor (IF) based on information theory, which, independent of the number of data classes, quantifies dataset imbalance extent in a single value in the range of [0, 1]. Besides, IF offers various limiting cases with different growth rates according to its alpha order. This property is critical as it can settle the possibility of having the same extent for distinct distributions. Eventually, empirical experiments indicate that with an average correlation of 0.766 with the classification accuracies over 15 real datasets, IF is remarkably more sensitive to class imbalance changes than other previous scales.
引用
收藏
页码:4157 / 4183
页数:27
相关论文
共 36 条
  • [1] Arndt C., 2004, Signals and Communication Technology
  • [2] Assessing the data complexity of imbalanced datasets
    Barella, Victor H.
    Garcia, Luis P. F.
    de Souto, Marcilio C. P.
    Lorena, Ana C.
    de Carvalho, Andre C. P. L. F.
    [J]. INFORMATION SCIENCES, 2021, 553 : 83 - 109
  • [3] LoRAS: an oversampling approach for imbalanced datasets
    Bej, Saptarshi
    Davtyan, Narek
    Wolfien, Markus
    Nassar, Mariam
    Wolkenhauer, Olaf
    [J]. MACHINE LEARNING, 2021, 110 (02) : 279 - 301
  • [4] Framework for extreme imbalance classification: SWIM-sampling with the majority class
    Bellinger, Colin
    Sharma, Shiven
    Japkowicz, Nathalie
    Zaiane, Osmar R.
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (03) : 841 - 866
  • [5] Manifold-based synthetic oversampling with manifold conformance estimation
    Bellinger, Colin
    Drummond, Christopher
    Japkowicz, Nathalie
    [J]. MACHINE LEARNING, 2018, 107 (03) : 605 - 637
  • [6] Relevance-Based Evaluation Metrics for Multi-class Imbalanced Domains
    Branco, Paula
    Torgo, Luis
    Ribeiro, Rita P.
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT I, 2017, 10234 : 698 - 710
  • [7] The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
    Brzezinski, Dariusz
    Minku, Leandro L.
    Pewinski, Tomasz
    Stefanowski, Jerzy
    Szumaczuk, Artur
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (06) : 1429 - 1469
  • [8] Conrad K., 2004, ENTROPY-SWITZ, V6, P10
  • [9] Studies on incidence pattern recognition based on information entropy
    Ding, SF
    Shi, ZZ
    [J]. JOURNAL OF INFORMATION SCIENCE, 2005, 31 (06) : 497 - 502
  • [10] A multiple resampling method for learning from imbalanced data sets
    Estabrooks, A
    Jo, TH
    Japkowicz, N
    [J]. COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) : 18 - 36