On the Dynamics of Classification Measures for Imbalanced and Streaming Data

被引:31
作者
Brzezinski, Dariusz [1 ,2 ]
Stefanowski, Jerzy [1 ,2 ]
Susmaga, Robert [1 ,2 ]
Szczech, Izabela [1 ,2 ]
机构
[1] Poznan Univ Tech, CAMIL, PL-60965 Poznan, Poland
[2] Poznan Univ Tech, Inst Comp Sci, PL-60965 Poznan, Poland
关键词
Data visualization; Atmospheric measurements; Particle measurements; Histograms; Task analysis; Size measurement; Sensitivity; Class imbalance; classification measures; concept drift; data streams; measure gradients; measure histograms;
D O I
10.1109/TNNLS.2019.2899061
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As each imbalanced classification problem comes with its own set of challenges, the measure used to evaluate classifiers must be individually selected. To help researchers make this decision in an informed manner, experimental and theoretical investigations compare general properties of measures. However, existing studies do not analyze changes in measure behavior imposed by different imbalance ratios. Moreover, several characteristics of imbalanced data streams, such as the effect of dynamically changing class proportions, have not been thoroughly investigated from the perspective of different metrics. In this paper, we study measure dynamics by analyzing changes of measure values, distributions, and gradients with diverging class proportions. For this purpose, we visualize measure probability mass functions and gradients. In addition, we put forward a histogram-based normalization method that provides a unified, probabilistic interpretation of any measure over data sets with different class distributions. The results of analyzing eight popular classification measures show that the effect class proportions have on each measure is different and should be taken into account when evaluating classifiers. Apart from highlighting imbalance-related properties of each measure, our study shows a direct connection between class ratio changes and certain types of concept drift, which could be influential in designing new types of classifiers and drift detectors for imbalanced data streams.
引用
收藏
页码:2868 / 2878
页数:11
相关论文
共 38 条
  • [1] Alaiz-Rodríguez R, 2008, LECT NOTES ARTIF INT, V5212, P660, DOI 10.1007/978-3-540-87481-2_43
  • [2] [Anonymous], 2006, P 23 INT C MACH LEAR, DOI [DOI 10.1145/1143844.1143874, 10.1145/1143844.1143874]
  • [3] [Anonymous], 2014, Evaluating Learning Algorithms A Classification Perspective, DOI DOI 10.1017/CBO9780511921803
  • [4] Assessing the accuracy of prediction algorithms for classification: an overview
    Baldi, P
    Brunak, S
    Chauvin, Y
    Andersen, CAF
    Nielsen, H
    [J]. BIOINFORMATICS, 2000, 16 (05) : 412 - 424
  • [5] Bekkar M., 2013, J. Inf. Eng. Apl, V3, P27, DOI DOI 10.5121/IJDKP.2013.3402
  • [6] Bifet A, 2010, J MACH LEARN RES, V11, P1601
  • [7] A Survey of Predictive Modeling on Im balanced Domains
    Branco, Paula
    Torgo, Luis
    Ribeiro, Rita P.
    [J]. ACM COMPUTING SURVEYS, 2016, 49 (02)
  • [8] Brzezinski Dariusz, 2015, New Frontiers in Mining Complex Patterns. Third International Workshop, NFMCP 2014, held in conjunction with ECML-PKDD 2014. Revised Selected Papers: LNCS 8983, P87, DOI 10.1007/978-3-319-17876-9_6
  • [9] Visual-based analysis of classification measures and their properties for class imbalanced problems
    Brzezinski, Dariusz
    Stefanowski, Jerzy
    Susmaga, Robert
    Szczech, Izabela
    [J]. INFORMATION SCIENCES, 2018, 462 : 242 - 261
  • [10] Prequential AUC: properties of the area under the ROC curve for data streams with concept drift
    Brzezinski, Dariusz
    Stefanowski, Jerzy
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 52 (02) : 531 - 562