Duplicates in the Drebin Dataset and Reduction in the Accuracy of the Malware Detection Models

被引：2

作者：

Mishra, Jyotiprakash ^{[1
]}

Sahay, Sanjay K. ^{[2
]}

Rathore, Hemant ^{[2
]}

Kumar, Lokesh ^{[2
]}

机构：

[1] Kalinga Inst Ind Technol, Sch Comp Engn, Bhubaneswar, India

[2] BITS Pilani, Dept Comp Sci & Informat Syst, KK Birla Goa Campus, Pilani, Goa, India

来源：

2021 26TH IEEE ASIA-PACIFIC CONFERENCE ON COMMUNICATIONS {APCC) | 2021年

关键词：

Android; Deep Neural Network; Fitting Factor; Machine Learning; Malware Detection;

D O I：

10.1109/APCC49754.2021.9609892

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The Android operating system has constantly remained in the limelight, hence attracts the attention of cybercriminals. Understanding the rising challenges, many researchers have bagged achievements by applying machine/deep learning techniques for the construction of malware detection models based on popular Drebin malware datasets. However, a cursory look at a table of the frequency of Dalvik opcodes leads us to believe that this dataset may have a massive number of duplicate malicious files. Hence, we used a technique called fitting factor to find the duplicate malicious files in the Drebin datasets on the basis of opcodes occurrence. We found that 51:57% malicious samples in the datasets have one or more duplicates. Hence, accordingly, we studied the performance of the popular detection models with and without duplicates with all the features, top 26 features engineered by Information Gain (IG) and Auto-Encoder (AE). The experimental results show that one of the most popular classical classifiers, the Random Forest classifier, shows a decline in accuracy by 4:2%, 5:3% and 8:8% with all features, top 26 features obtained by IG and AE respectively. To establish the observed facts we further extensively experimented with Decision Tree, Bagging, Gradient Boost, XG Boost, and Deep Neural Network. The most significant decline (12:2%) in accuracy was observed in the Deep Neural Network classifier with the features obtained by IG, i.e., the earlier reported performance of the malware detection models based on Drebin data is exaggerated, and consequently it may lead to a wrong direction in this field of research.

引用

页码：161 / 165

页数：5

共 24 条

[1] Apktool, 2021, US
[2] Drebin: Effective and Explainable Detection of Android Malware in Your Pocket
Arp, Daniel
Spreitzenbarth, Michael
Huebner, Malte
Gascon, Hugo
Rieck, Konrad
[J]. 21ST ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2014), 2014,
[3] An opcode-based technique for polymorphic Internet of Things malware detection
Darabian, Hamid
Dehghantanha, Ali
Hashemi, Sattar
Homayoun, Sajad
Choo, Kim-Kwang Raymond
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (06)
[4] Control flow-based opcode behavior analysis for Malware detection
Ding, Yuxin
Dai, Wei
Yan, Shengli
Zhang, Yumei
[J]. COMPUTERS & SECURITY, 2014, 44 : 65 - 74
[5] Goodfellow I, 2016, ADAPT COMPUT MACH LE, P1
[6] The duplication issue within the Drebin dataset
Irolla, Paul
Dey, Alexandre
[J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2018, 14 (03) : 245 - 249
[7] Malware-Detection Method with a Convolutional Recurrent Neural Network Using Opcode Sequences
Jeon, Seungho
Moon, Jongsub
[J]. INFORMATION SCIENCES, 2020, 535 : 1 - 15
[8] KENT JT, 1983, BIOMETRIKA, V70, P163, DOI 10.1093/biomet/70.1.163
[9] opc, about us
[10] Press WH, 1989, NUMERICAL RECIPES, DOI DOI 10.1017/S0022112005004507

← 1 2 3 →