Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data

被引：0

作者：

Morillo, Paulina ^{[1
,2
]}

Bahamonde, Diego ^{[1
]}

Tapia, Wilian ^{[1
]}

机构：

[1] Univ Politecn Salesiana, Comp Sci Engn, Quito, Ecuador

[2] Univ Politecn Salesiana, IDEIAGEOCA Res Grp, Quito, Ecuador

来源：

INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, INTELLISYS 2023 | 2024年 / 822卷

关键词：

Binary Classification; Re-Sample; Oversampling; Undersampling; Hybrid; Balance Accuracy; G-Mean; AUC; Confusion Matrix;

D O I：

10.1007/978-3-031-47721-8_33

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Detecting malware is crucial to avoid severe damage to a computer system. However, doing it by training Machine Learning algorithms can present complications since often there is imbalanced data. Therefore, one of the challenges faced by binary classification is learning to clearly distinguish between two classes when you have a much larger number of instances of one class than another. To decrease bias and to handle imbalance, some techniques increase or reduce the number of cases of the minority and majority classes, respectively. This paper analyzes the performance of three cost-sensitive classifiers, LR, DT, and RF, trained with an imbalanced malware detection dataset and four artificial datasets built using Near Miss, SMOTE, SMOTEENN, and SMOTE-Tomek re-sample techniques. The results show that Near Miss achieves a proper balance between the classes so that the algorithms increase their overall performance, reaching balanced accuracies greater than 95%. On the other hand, the rest of the techniques slightly increase the ability of the classifiers to identify objects of the minority class. Meanwhile, Random Forest achieved balanced and high performance. Besides, the training and testing times for oversampling or hybrid techniques are far superior to those obtained by undersampling since the latter reduces the number of instances processed by the models.

引用

页码：496 / 507

页数：12

共 39 条

[1]

Ali H., 2019, Indones. J. Electr. Eng. Comput. Sci., V14, P1560, DOI DOI 10.11591/IJEECS.V14.I3.PP1552-1563

[2] A comparison of machine learning algorithms on design smell detection using balanced and imbalanced dataset: A study of God class [J].

Alkharabsheh, Khalid ;

Alawadi, Sadi ;

Kebande, Victor R. ;

Crespo, Yania ;

Fernandez-Delgado, Manuel ;

Taboada, Jose A. .

INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 143

[3]

Amos B, 2013, INT WIREL COMMUN, P1666, DOI 10.1109/IWCMC.2013.6583806

[4] A Comprehensive Review on Malware Detection Approaches [J].

Aslan, Omer ;

Samet, Refik .

IEEE ACCESS, 2020, 8 :6249-6271

[5] Resampling imbalanced data for network intrusion detection datasets [J].

Bagui, Sikha ;

Li, Kunqi .

JOURNAL OF BIG DATA, 2021, 8 (01)

[6]

Blake RH., 2011, ACM J Data Inf Qual, V2, P8, DOI DOI 10.1145/1891879.1891881

[7]

Brodersen Kay H., 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), P3121, DOI 10.1109/ICPR.2010.764

[8] A new clustering mining algorithm for multi-source imbalanced location data [J].

Cai, Li ;

Wang, Haoyu ;

Jiang, Fang ;

Zhang, Yihan ;

Peng, Yuzhong .

INFORMATION SCIENCES, 2022, 584 :50-64

[9] Machine learning based mobile malware detection using highly imbalanced network traffic [J].

Chen, Zhenxiang ;

Yan, Qiben ;

Han, Hongbo ;

Wang, Shanshan ;

Peng, Lizhi ;

Wang, Lin ;

Yang, Bo .

INFORMATION SCIENCES, 2018, 433 :346-364

[10] A comparison of static, dynamic, and hybrid analysis for malware detection [J].

Damodaran A. ;

Troia F.D. ;

Visaggio C.A. ;

Austin T.H. ;

Stamp M. .

Journal of Computer Virology and Hacking Techniques, 2017, 13 (1) :1-12

← 1 2 3 4 →