Machine learning techniques for imbalanced multiclass malware classification through adaptive feature selection

被引:1
作者
Panda, Binayak [1 ]
Bisoyi, Sudhanshu Shekhar [2 ]
Panigrahy, Sidhanta [3 ]
Mohanty, Prithviraj [2 ]
机构
[1] Siksha O Anusandhan Deemed Univ, Inst Tech Educ & Res, Dept Comp Sci & Engn, Bhubaneswar, Odisha, India
[2] Siksha O Anusandhan Deemed Univ, Inst Tech Educ & Res, Dept Comp Sci & Informat Technol, Bhubaneswar, Odisha, India
[3] Univ Calif Berkeley, Haas Sch Business, Berkeley, CA USA
关键词
Greedy feature selection; TF-IDF; Skip-gram; Machine learning; API sequence; Multiclass malware classification;
D O I
10.7717/peerj-cs.2752
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detecting polymorphic or metamorphic variants of known malware is an ever-growing challenge, just like detecting new malware. Artificial intelligence techniques are preferred over conventional signature-based malware detection as the number of malware variants proliferates. This article proposes an Adaptive Multiclass Malware Classification (AMMC) framework that trains base machine learning models with fewer computational resources to detect malware. Furthermore, this work proposes a novel adaptive feature selection (AFS) technique using the greedy strategy on term frequency and inverse document frequency (TF-IDF) feature weights to address the selection of influential features and ensure better performance metrics in imbalanced multiclass malware classification problems. To assess AMMC's efficacy using AFS, three open imbalanced multiclass malware datasets (VirusShare with eight classes, VirusSample with six classes, and MAL-API-2019 with eight classes) on Windows API sequence features were used. Experimental results demonstrate the effectiveness of AMMC with AFS, achieving state-of-the-art performance on VirusShare, VirusSample, and MAL-API-2019 with a macro F1-score of 0.92, 0.94, and 0.84 and macro area under the curve (AUC) of 0.99, 0.99, and 0.98, respectively. The performance measurements obtained with AMMC for all datasets were highly promising.
引用
收藏
页数:39
相关论文
共 44 条
[1]   Analyzing the performance of long short-term memory architectures for malware detection models [J].
Avci, Cigdem ;
Tekinerdogan, Bedir ;
Catal, Cagatay .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (06) :1
[2]   A system call-based android malware detection approach with homogeneous & heterogeneous ensemble machine learning [J].
Bhat, Parnika ;
Behal, Sunny ;
Dutta, Kamlesh .
COMPUTERS & SECURITY, 2023, 130
[3]  
Cannarile A., 2022, IT C CYB JUN 20 23
[4]   Deep learning based Sequential model for malware analysis using Windows exe API Calls [J].
Catak, Ferhat Ozgur ;
Yaz, Ahmet Faruk ;
Elezaj, Ogerta ;
Ahmed, Javed .
PEERJ COMPUTER SCIENCE, 2020,
[5]  
Cohen F., 1987, Computers & Security, V6, P22, DOI 10.1016/0167-4048(87)90122-2
[6]   A comparison of static, dynamic, and hybrid analysis for malware detection [J].
Damodaran A. ;
Troia F.D. ;
Visaggio C.A. ;
Austin T.H. ;
Stamp M. .
Journal of Computer Virology and Hacking Techniques, 2017, 13 (01) :1-12
[7]   An ensemble of pre-trained transformer models for imbalanced multiclass malware classification [J].
Demirkiran, Ferhat ;
Cayir, Aykut ;
Unal, Gur ;
Dag, Hasan .
COMPUTERS & SECURITY, 2022, 121
[8]   A fast malware detection algorithm based on objective-oriented association mining [J].
Ding, Yuxin ;
Yuan, Xuebing ;
Tang, Ke ;
Xiao, Xiao ;
Zhang, Yibin .
COMPUTERS & SECURITY, 2013, 39 :315-324
[9]  
Duzgun B., 2021, arXiv, DOI DOI 10.48550/ARXIV.2111.15205
[10]   Malware Visualization for Fine-Grained Classification [J].
Fu, Jianwen ;
Xue, Jingfeng ;
Wang, Yong ;
Liu, Zhenyan ;
Shan, Chun .
IEEE ACCESS, 2018, 6 :14510-14523