Duplicates in the Drebin Dataset and Reduction in the Accuracy of the Malware Detection Models

被引:2
作者
Mishra, Jyotiprakash [1 ]
Sahay, Sanjay K. [2 ]
Rathore, Hemant [2 ]
Kumar, Lokesh [2 ]
机构
[1] Kalinga Inst Ind Technol, Sch Comp Engn, Bhubaneswar, India
[2] BITS Pilani, Dept Comp Sci & Informat Syst, KK Birla Goa Campus, Pilani, Goa, India
来源
2021 26TH IEEE ASIA-PACIFIC CONFERENCE ON COMMUNICATIONS {APCC) | 2021年
关键词
Android; Deep Neural Network; Fitting Factor; Machine Learning; Malware Detection;
D O I
10.1109/APCC49754.2021.9609892
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The Android operating system has constantly remained in the limelight, hence attracts the attention of cybercriminals. Understanding the rising challenges, many researchers have bagged achievements by applying machine/deep learning techniques for the construction of malware detection models based on popular Drebin malware datasets. However, a cursory look at a table of the frequency of Dalvik opcodes leads us to believe that this dataset may have a massive number of duplicate malicious files. Hence, we used a technique called fitting factor to find the duplicate malicious files in the Drebin datasets on the basis of opcodes occurrence. We found that 51:57% malicious samples in the datasets have one or more duplicates. Hence, accordingly, we studied the performance of the popular detection models with and without duplicates with all the features, top 26 features engineered by Information Gain (IG) and Auto-Encoder (AE). The experimental results show that one of the most popular classical classifiers, the Random Forest classifier, shows a decline in accuracy by 4:2%, 5:3% and 8:8% with all features, top 26 features obtained by IG and AE respectively. To establish the observed facts we further extensively experimented with Decision Tree, Bagging, Gradient Boost, XG Boost, and Deep Neural Network. The most significant decline (12:2%) in accuracy was observed in the Deep Neural Network classifier with the features obtained by IG, i.e., the earlier reported performance of the malware detection models based on Drebin data is exaggerated, and consequently it may lead to a wrong direction in this field of research.
引用
收藏
页码:161 / 165
页数:5
相关论文
共 24 条
  • [1] Apktool, 2021, US
  • [2] Drebin: Effective and Explainable Detection of Android Malware in Your Pocket
    Arp, Daniel
    Spreitzenbarth, Michael
    Huebner, Malte
    Gascon, Hugo
    Rieck, Konrad
    [J]. 21ST ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2014), 2014,
  • [3] An opcode-based technique for polymorphic Internet of Things malware detection
    Darabian, Hamid
    Dehghantanha, Ali
    Hashemi, Sattar
    Homayoun, Sajad
    Choo, Kim-Kwang Raymond
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (06)
  • [4] Control flow-based opcode behavior analysis for Malware detection
    Ding, Yuxin
    Dai, Wei
    Yan, Shengli
    Zhang, Yumei
    [J]. COMPUTERS & SECURITY, 2014, 44 : 65 - 74
  • [5] Goodfellow I, 2016, ADAPT COMPUT MACH LE, P1
  • [6] The duplication issue within the Drebin dataset
    Irolla, Paul
    Dey, Alexandre
    [J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2018, 14 (03) : 245 - 249
  • [7] Malware-Detection Method with a Convolutional Recurrent Neural Network Using Opcode Sequences
    Jeon, Seungho
    Moon, Jongsub
    [J]. INFORMATION SCIENCES, 2020, 535 : 1 - 15
  • [8] KENT JT, 1983, BIOMETRIKA, V70, P163, DOI 10.1093/biomet/70.1.163
  • [9] opc, about us
  • [10] Press WH, 1989, NUMERICAL RECIPES, DOI DOI 10.1017/S0022112005004507