The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms' Performance

被引:39
作者
Alshdaifat, Esra'a [1 ]
Alshdaifat, Doa'a [1 ]
Alsarhan, Ayoub [1 ]
Hussein, Fairouz [1 ]
El-Salhi, Subhieh Moh'd Faraj S. [1 ]
机构
[1] Hashemite Univ, Fac Prince Al Hussein Bin Abdallah II Informat Te, Dept Comp Informat Syst, POB 330127, Zarqa 13133, Jordan
关键词
preprocessing; classification algorithms; normalization; missing values; classification performance; data cleaning;
D O I
10.3390/data6020011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
It is recognized that the performance of any prediction model is a function of several factors. One of the most significant factors is the adopted preprocessing techniques. In other words, preprocessing is an essential process to generate an effective and efficient classification model. This paper investigates the impact of the most widely used preprocessing techniques, with respect to numerical features, on the performance of classification algorithms. The effect of combining various normalization techniques and handling missing values strategies is assessed on eighteen benchmark datasets using two well-known classification algorithms and adopting different performance evaluation metrics and statistical significance tests. According to the reported experimental results, the impact of the adopted preprocessing techniques varies from one classification algorithm to another. In addition, a statistically significant difference between the considered data preprocessing techniques is demonstrated.
引用
收藏
页码:1 / 23
页数:23
相关论文
共 37 条
  • [1] Acuña E, 2004, ST CLASS DAT ANAL, P639
  • [2] Short-Term Spatio-Temporal Forecasting of Photovoltaic Power Production
    Agoua, Xwegnon Ghislain
    Girard, Robin
    Kariniotakis, George
    [J]. IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, 2018, 9 (02) : 538 - 546
  • [3] Short-term wind speed forecasting by spectral analysis from long-term observations with missing values
    Akcay, Huseyin
    Filik, Tansu
    [J]. APPLIED ENERGY, 2017, 191 : 653 - 662
  • [4] Dealing with Missing Data and Uncertainty in the Context of Data Mining
    Aleryani, Aliya
    Wang, Wenjia
    De La Iglesia, Beatriz
    [J]. HYBRID ARTIFICIAL INTELLIGENT SYSTEMS (HAIS 2018), 2018, 10870 : 289 - 301
  • [5] Impact of preprocessing on medical data classification
    Almuhaideb, Sarab
    Menai, Mohamed El Bachir
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (06) : 1082 - 1102
  • [6] Baitharu T. R., 2013, J EMERG TRENDS ENG A, V4, P311
  • [7] Spatial autocorrelation and entropy for renewable energy forecasting
    Ceci, Michelangelo
    Corizzo, Roberto
    Malerba, Donato
    Rashkovska, Aleksandra
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 33 (03) : 698 - 729
  • [8] Multi-aspect renewable energy forecasting
    Corizzo, Roberto
    Ceci, Michelangelo
    Fanaee-T, Hadi
    Gama, Joao
    [J]. INFORMATION SCIENCES, 2021, 546 : 701 - 722
  • [9] Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data
    Corizzo, Roberto
    Ceci, Michelangelo
    Japkowicz, Nathalie
    [J]. BIG DATA RESEARCH, 2019, 16 : 18 - 35
  • [10] The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing
    Crone, Sven F.
    Lessmann, Stefan
    Stahlbock, Robert
    [J]. EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2006, 173 (03) : 781 - 800