When to Use Standardization and Normalization: Empirical Evidence From Machine Learning Models and XAI

被引：19

作者：

Sujon, Khaled Mahmud ^{[1
]}

Hassan, Rohayanti Binti ^{[2
]}

Towshi, Zeba Tusnia ^{[3
]}

Othman, Manal A. ^{[4
]}

Samad, Md Abdus ^{[5
]}

Choi, Kwonhue ^{[5
]}

机构：

[1] Univ Teknol Malaysia UTM, Fac Comp, Dept Software Engn, Johor Baharu 81310, Johor, Malaysia

[2] Univ Teknol Malaysia UTM, Fac Comp, Johor Baharu 81310, Johor, Malaysia

[3] Independent Univ, Dept Comp Sci & Engn, Dhaka 1229, Bangladesh

[4] Princess Nourah Bint Abdulrahman Univ, Coll Med, Med Educ Dept, Biomed Informat, Riyadh 11671, Saudi Arabia

[5] Yeungnam Univ, Dept Informat & Commun Engn, Gyongsan 38541, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Standardization; normalization; feature scaling; data preprocessing; machine learning; explainable AI (XAI);

D O I：

10.1109/ACCESS.2024.3462434

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Optimizing machine learning (ML) model performance relies heavily on appropriate data preprocessing techniques. Despite the widespread use of standardization and normalization, empirical comparisons across different models, dataset sizes, and domains remain sparse. This study bridges this gap by evaluating five machine learning algorithms- Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Adaptive Boosting (AdaBoost)-on datasets of varying sizes from the business, health, and agriculture domains. This study assessed the models without scaling, with standardized data, and with normalized data. The comparative analysis reveals that while standardization consistently improves the performance of linear models like SVM and LR for large and medium datasets, normalization enhances the performance of linear models for small datasets. Moreover, this study employs SHapley Additive exPlanations (SHAP) summary plots to understand how each feature contributes to the model's performance interpretability with unscaled and scaled datasets. This study provides practical guidelines for selecting appropriate scaling techniques based on the characteristics of datasets and compatibility with various algorithms. Finally, this investigation laid the foundation for data preprocessing and feature engineering across diverse models and domains which offers actionable insights for practitioners.

引用

页码：135300 / 135314

页数：15

共 38 条

[11] Big data mining optimization algorithm based on machine learning model [J].

Jiao C. .

Revue d'Intelligence Artificielle, 2020, 34 (01) :51-57

[12] Creating Robust Predictive Radiomic Models for Data From Independent Institutions Using Normalization [J].

Chatterjee, Avishek ;

Vallieres, Martin ;

Dohan, Anthony ;

Levesque, Ives R. ;

Ueno, Yoshiko ;

Saif, Sameh ;

Reinhold, Caroline ;

Seuntjens, Jan .

IEEE TRANSACTIONS ON RADIATION AND PLASMA MEDICAL SCIENCES, 2019, 3 (02) :210-215

[13]

Collins J, 2019, Arxiv, DOI arXiv:1903.00925

[14]

Dirjen S. K., 2020, J. Resti, V4, P117

[15] Adversarially Adaptive Normalization for Single Domain Generalization [J].

Fan, Xinjie ;

Wang, Qifei ;

Ke, Junjie ;

Yang, Feng ;

Gong, Boqing ;

Zhou, Mingyuan .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8204-8213

[16] C-Norm: a neural approach to few-shot entity normalization [J].

Ferre, Arnaud ;

Deleger, Louise ;

Bossy, Robert ;

Zweigenbaum, Pierre ;

Nedellec, Claire .

BMC BIOINFORMATICS, 2020, 21 (Suppl 23)

[17]

Garcia S, 2015, INTEL SYST REF LIBR, V72, P1, DOI 10.1007/978-3-319-10247-4

[18]

Gopal S., 2015, International Advanced Research Journal in Science, Engineering and Technology, V2, P20, DOI DOI 10.17148/IARJSET.2015.2305

[19] Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine [J].

Guo, Lin Lawrence ;

Pfohl, Stephen R. ;

Fries, Jason ;

Johnson, Alistair E. W. ;

Posada, Jose ;

Aftandilian, Catherine ;

Shah, Nigam ;

Sung, Lillian .

SCIENTIFIC REPORTS, 2022, 12 (01)

[20] An in-depth analysis of logarithmic data transformation and per-class normalization in machine learning: Application to unsupervised classification of a turbidite system in the Canterbury Basin, New Zealand, and supervised classification of salt in the Eugene Island minibasin, Gulf of Mexico [J].

Ha, Thang N. ;

Lubo-Robles, David ;

Marfurt, Kurt J. ;

Wallet, Bradley C. .

INTERPRETATION-A JOURNAL OF SUBSURFACE CHARACTERIZATION, 2021, 9 (03) :T685-T710

← 1 2 3 4 →