The choice of scaling technique matters for classification performance

被引:114
作者
de Amorim, Lucas B., V [1 ,2 ]
Cavalcanti, George D. C. [1 ]
Cruz, Rafael M. O. [3 ]
机构
[1] Univ Fed Pernambuco, Ctr Informat, Recife, PE, Brazil
[2] Univ Fed Alagoas, Inst Comp, Maceio, Alagoas, Brazil
[3] Univ Quebec, Ecole Technol Super, Ste Foy, PQ, Canada
关键词
Classification; Normalization; Standardization; Scaling; Preprocessing; Ensemble of classifiers; Multiple Classifier System; DYNAMIC SELECTION; COMBINATION;
D O I
10.1016/j.asoc.2022.109924
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.1 (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:19
相关论文
共 43 条
[1]  
Aggarwal C.C., 2018, Neural networks and deep learning: a textbook, DOI DOI 10.1007/978-3-319-94463-0
[2]  
Akosa J., 2017, P SAS GLOB FOR SAC I, V12, P1
[3]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[4]  
[Anonymous], 2001, IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1007/BF00058655
[7]  
Breiman L., 2017, CLASSIFICATION REGRE
[8]   Dynamic selection of classifiers-A comprehensive review [J].
Britto, Alceu S., Jr. ;
Sabourin, Robert ;
Oliveira, Luiz E. S. .
PATTERN RECOGNITION, 2014, 47 (11) :3665-3680
[9]   Dynamic selection approaches for multiple classifier systems [J].
Cavalin, Paulo R. ;
Sabourin, Robert ;
Suen, Ching Y. .
NEURAL COMPUTING & APPLICATIONS, 2013, 22 (3-4) :673-688
[10]  
Cavalin PR, 2010, LECT NOTES COMPUT SC, V5997, P145