Unsupervised Domain Adaptation for Static Malware Detection based on Gradient Boosting Trees

被引:2
作者
Qi, Panpan [1 ]
Wang, Wei [1 ]
Zhu, Lei [2 ]
Ng, See Kiong [3 ]
机构
[1] Natl Univ Singapore, Sch Comp, Singapore, Singapore
[2] NUS Grad Sch, Integrat Sci & Engn Programme, Singapore, Singapore
[3] Natl Univ Singapore, Inst Data Sci, Singapore, Singapore
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021 | 2021年
基金
新加坡国家研究基金会;
关键词
unsupervised domain adaptation; malware detection; gradient boosting decision tree;
D O I
10.1145/3459637.3482400
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Static malware detection is important for protection against malware by allowing for malicious files to be detected prior to execution. It is also especially suitable for machine learning-based approaches. Recently, gradient boosting decision trees (GBDT) models, e.g., LightGBM (a popular implementation of GBDT), have shown outstanding performance for malware detection. However, as malware programs are known to evolve rapidly, malware classification models trained on the (source) training data often fail to generalize to the target domain, i.e., the deployed environment. To handle the underlying data distribution drifts, unsupervised domain adaptation techniques have been proposed for machine learning models including deep learning models. However, unsupervised domain adaptation for GBDT has remained challenging. In this paper, we adapt the adversarial learning framework for unsupervised domain adaptation to enable GBDT learn domain-invariant features and alleviate performance degradation in the target domain. In addition, to fully exploit the unlabelled target data, we merge them into the training dataset after pseudo-labelling. We propose a new weighting scheme integrated into GBDT for sampling instances in each boosting round to reduce the negative impact of wrongly labelled target instances. Experiments on two large malware datasets demonstrate the superiority of our proposed method.
引用
收藏
页码:1457 / 1466
页数:10
相关论文
共 35 条
[1]  
[Anonymous], 2013, INT C MACH LEARN
[2]  
[Anonymous], 2007, THESIS
[3]  
[Anonymous], 2015, ARXIV151105547
[4]  
[Anonymous], ARXIV180404637
[5]  
Arnold A., 2007, 7 IEEE INT C DAT MIN, P77
[6]  
AV-TEST, 2017, SEC REP 2017 18
[7]  
Bickel S, 2009, J MACH LEARN RES, V10, P2137
[8]  
Chen C, 2020, AAAI CONF ARTIF INTE, V34, P3422
[9]   Progressive Feature Alignment for Unsupervised Domain Adaptation [J].
Chen, Chaoqi ;
Xie, Weiping ;
Huang, Wenbing ;
Rong, Yu ;
Ding, Xinghao ;
Huang, Yue ;
Xu, Tingyang ;
Huang, Junzhou .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :627-636
[10]  
Chen QC, 2020, AAAI CONF ARTIF INTE, V34, P10567