Using Multi-features and Ensemble Learning Method for Imbalanced Malware Classification

被引:0
作者
Zhang, Yunan [1 ]
Huang, Qingjia [1 ]
Ma, Xinjian [1 ]
Yang, Zeming [1 ]
Jiang, Jianguo [2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing Key Lab Network Secur Technol, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
来源
2016 IEEE TRUSTCOM/BIGDATASE/ISPA | 2016年
关键词
D O I
10.1109/TrustCom.2016.161
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The ever-growing malware threats in the cyber space calls for techniques that are more effective than widely deployed signature-based detection system. To counter large volumes of malware variants, machine learning techniques have been applied for automated malware classification. Despite these efforts have achieved a certain success, the accuracy and efficiency still remained inadequate to meet demand, especially when these machine learning techniques are used in the situation of multiple class classification and imbalanced training data. Against this backdrop, the goal of this paper is to build a malware classification system that could be used to improve the above mentioned situation. Our system is based on multiple categories of static features and ensemble learning method. Compared to some traditional systems it has the following advantages. Firstly, with multiple categories of features, our system could classify malware to their corresponding family effectively and efficiently and at the same time avoid the influence of evasion in certain extent. Our method don't need any unpacking process and extract features from the bytes file and disassembled asm file directly. Secondly, the system employed two efficient ensemble learning models, namely XGBoost and ExtraTreeClassifer, and also combined stacking method to construct the final classifier. Finally, we experimented our system with the dataset provided by Microsoft hosted in Kaggle for malware classification competition, and the final results show that our method could classify malware to their family effectively and efficiently with the accuracy of 0.9972 in training set and logloss of 0.00395 in testing set. Our work not only offer insights into how to use multiple features for classification, but also shed light on how to develop a scalable techniques for automated malware classification in practice.
引用
收藏
页码:965 / 973
页数:9
相关论文
共 36 条
  • [1] Ahmadi M., 2016, Novel feature extraction, selection and fusion for effective malware family classification
  • [2] [Anonymous], 2010, NDSS
  • [3] [Anonymous], 2009, FEATURE WEIGHTED LIN
  • [4] Bilar D., 2006, BLACKHAT
  • [5] Christodorescu M., 2003, Static analysis of executables to detect malicious patterns
  • [6] David O.E., 2015, 2015 INT JOINT C NEU, P1, DOI DOI 10.1109/IJCNN.2015.7280815
  • [7] Extremely randomized trees
    Geurts, P
    Ernst, D
    Wehenkel, L
    [J]. MACHINE LEARNING, 2006, 63 (01) : 3 - 42
  • [8] Hu Xin, 2009, LARGE SCALE MALWARE
  • [9] Classification of malware based on integrated static and dynamic features
    Islam, Rafiqul
    Tian, Ronghua
    Batten, Lynn M.
    Versteeg, Steve
    [J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2013, 36 (02) : 646 - 656
  • [10] Jacob Gregoire, 2013, STATIC PACKER AGNOST