When Malware is Packin' Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features

被引:81
作者
Aghakhani, Hojjat [1 ]
Gritti, Fabio [1 ]
Mecca, Francesco [2 ]
Lindorfer, Martina [3 ]
Ortolani, Stefano [4 ]
Balzarotti, Davide [5 ]
Vigna, Giovanni [1 ]
Krueger, Christopher [1 ]
机构
[1] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
[2] Univ Torino, Turin, Italy
[3] TU Wien, Vienna, Austria
[4] Lastline Inc, Redwood City, CA USA
[5] Eurecom, Biot, France
来源
27TH ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2020) | 2020年
基金
欧洲研究理事会; 美国国家科学基金会;
关键词
D O I
10.14722/ndss.2020.24310
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning techniques are widely used in addition to signatures and heuristics to increase the detection rate of anti-malware software, as they automate the creation of detection models, making it possible to handle an ever-increasing number of new malware samples. In order to foil the analysis of anti-malware systems and evade detection, malware uses packing and other forms of obfuscation. However, few realize that benign applications use packing and obfuscation as well, to protect intellectual property and prevent license abuse. In this paper, we study how machine learning based on static analysis features operates on packed samples. Malware researchers have often assumed that packing would prevent machine learning techniques from building effective classifiers. However, both industry and academia have published results that show that machine-learning-based classifiers can achieve good detection rates, leading many experts to think that classifiers are simply detecting the fact that a sample is packed, as packing is more prevalent in malicious samples. We show that, different from what is commonly assumed, packers do preserve some information when packing programs that is "useful" for malware classification. However, this information does not necessarily capture the sample's behavior. We demonstrate that the signals extracted from packed executables are not rich enough for machine-learning-based models to (1) generalize their knowledge to operate on unseen packers, and (2) be robust against adversarial examples. We also show that a naive application of machine learning techniques results in a substantial number of false positives, which, in turn, might have resulted in incorrect labeling of ground-truth data used in past work.
引用
收藏
页数:20
相关论文
共 109 条
  • [1] Abou-Assaleh T, 2004, P INT COMP SOFTW APP, P41
  • [2] Anderson H. S., 2018, ARXIV PREPRINT ARXIV
  • [3] [Anonymous], P WORKSH HOT TOP OP
  • [4] [Anonymous], 2019, RES EASILY TRICK CYL
  • [5] A Heuristics-based Static Analysis Approach for Detecting Packed PE Binaries
    Arora, Rohit
    Singh, Anishka
    Pareek, Himanshu
    Edara, Usha Rani
    [J]. INTERNATIONAL JOURNAL OF SECURITY AND ITS APPLICATIONS, 2013, 7 (05): : 257 - 268
  • [6] Impact of Code Obfuscation on Android Malware Detection based on Static and Dynamic Analysis
    Bacci, Alessandro
    Bartoli, Alberto
    Martinelli, Fabio
    Medvet, Eric
    Mercaldo, Francesco
    Visaggio, Corrado Aaron
    [J]. ICISSP: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS SECURITY AND PRIVACY, 2018, : 379 - 385
  • [7] Opcodes as predictor for malware
    Bilar, Daniel
    [J]. INTERNATIONAL JOURNAL OF ELECTRONIC SECURITY AND DIGITAL FORENSICS, 2007, 1 (02) : 156 - 168
  • [8] Bishop C., 1995, Neural networks for pattern recognition
  • [9] Bonfante G., 2015, CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions
  • [10] Brandon Robert, 2018, Proc. AAAIW