Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems

被引:31
作者
Euh, Seoungyul [1 ]
Lee, Hyunjong [1 ]
Kim, Donghoon [2 ]
Hwang, Doosung [3 ]
机构
[1] KSign, Secur Technol Inst, Seoul 06231, South Korea
[2] Arkansas State Univ, Dept Comp Sci, Jonesboro, AR 72467 USA
[3] Dankook Univ, Dept Software Sci, Yongin 16890, South Korea
关键词
Malware; Feature extraction; Entropy; Training; Forestry; Machine learning algorithms; Machine learning; Malware detection; feature extraction; tree-based ensemble; AUC-PRC; CLASSIFICATION; BEHAVIOR;
D O I
10.1109/ACCESS.2020.2986014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.
引用
收藏
页码:76796 / 76808
页数:13
相关论文
共 51 条
[1]   Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification [J].
Ahmadi, Mansour ;
Ulyanov, Dmitry ;
Semenov, Stanislav ;
Trofimov, Mikhail ;
Giacinto, Giorgio .
CODASPY'16: PROCEEDINGS OF THE SIXTH ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY, 2016, :183-194
[2]   Profiling and classifying the behavior of malicious codes [J].
Alazab, Mamoun .
JOURNAL OF SYSTEMS AND SOFTWARE, 2015, 100 :91-102
[3]  
[Anonymous], TR200907 U MANNH I C
[4]  
[Anonymous], MALICIA PROJECT MALW
[5]  
[Anonymous], 2011, Mining of massive datasets
[6]  
[Anonymous], 2017, P 2017 10 INT C IMAG
[7]  
[Anonymous], 2010, CLASS IMBALANCE PROB, DOI DOI 10.1007/978-0-387-30164-8_110
[8]  
Blaser R, 2016, J MACH LEARN RES, V17
[9]  
Breiman L., 2002, MANUAL SETTING USING, V1, P58
[10]  
Breiman L., 2001, IEEE Trans. Broadcast., V45, P5