Overcoming the lack of labeled data: Training malware detection models using adversarial domain adaptation

被引:7
作者
Bhardwaj, Sonam [1 ]
Li, Adrian Shuai [2 ]
Dave, Mayank [1 ]
Bertino, Elisa [2 ]
机构
[1] Natl Inst Technol Kurukshetra, Dept Comp Engn, Kurukshetra, India
[2] Purdue Univ, Dept Comp Sci, W Lafayette, IN USA
基金
美国国家科学基金会;
关键词
Transfer learning; Malware detection; Malware images; CNN; Domain adaptation; Generative adversarial networks;
D O I
10.1016/j.cose.2024.103769
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many current malware detection methods are based on supervised learning techniques, which however have certain limitations. First, these techniques require a large amount of labeled data for training which is often difficult to obtain. Second, they are not very effective when there are differences in domain distribution between new malware and known malware. To address these issues, we propose MD-ADA - a malware detection framework that leverages adversarial domain adaptation (DA). DA allows one to adapt a training malware dataset available at a domain, referred to as the source, for training a classifier in another domain, referred to as the target. DA, typically used when the target has limited training malware data available, maps the source and target datasets into a common latent space. As we use an image representation for malware binaries, MD-ADA uses a convolution neural network (CNN) providing a lossless image embedding for the source and target datasets. MD-ADA also employs a generative adversarial network (GAN) for malware classification that is suitable for scenarios with few target-labeled data where the distribution of the features is similar (homogeneous) or different (heterogeneous). We have carried out several experiments to assess the performance of MD-ADA. The experiments show that MD-ADA outperforms the fine-tuning approach with an accuracy of 99.29% on the BODMAS dataset, 89.3% for the Malevis dataset on homogeneous feature distribution, and 90.12% on the CICMalMem2022 dataset (Target) and 83.23% on the Microsoft Kaggle dataset (Target) for heterogeneous feature distribution. The observed F1-scores of 99.13% and 87.5% for homogeneous feature distributions and 91.27% and 81.7% for heterogeneous distributions indicate that the MD-ADA performance is satisfactory for both data distributions when the target has very few labeled data.
引用
收藏
页数:11
相关论文
共 35 条
[1]   A Multifaceted Deep Generative Adversarial Networks Model for Mobile Malware Detection [J].
Alotaibi, Fahad Mazaed ;
Fawad .
APPLIED SCIENCES-BASEL, 2022, 12 (19)
[2]  
Anderson B., 2014, Proceedings of the ACM Conference on Computer and Communications Security, in AISec'14, V2014- Novem, P103
[3]  
[Anonymous], 2023, Sonic Wall Threat Report
[4]  
Bensaoud A., 2020, Int J Netw Secur, V22, P1022
[5]  
Bhagat RC, 2015, IEEE INT ADV COMPUT, P403, DOI 10.1109/IADCC.2015.7154739
[6]  
Bhodia N, 2019, Arxiv, DOI arXiv:1903.11551
[7]   Detection of Malicious Code Variants Based on Deep Learning [J].
Cui, Zhihua ;
Xue, Fei ;
Cai, Xingjuan ;
Cao, Yang ;
Wang, Gai-ge ;
Chen, Jinjun .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2018, 14 (07) :3187-3196
[8]   Visualization and deep-learning-based malware variant detection using OpCode-level features [J].
Darem, Abdulbasit ;
Abawajy, Jemal ;
Makkar, Aaisha ;
Alhashmi, Asma ;
Alanazi, Sultan .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 125 :314-323
[9]  
Downing E, 2021, PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, P3469
[10]  
Fu Z., 2021, J CYBERSECURITY, V3, P11, DOI DOI 10.32604/JCS.2021.016632