A cost analysis of machine learning using dynamic runtime opcodes for malware detection

被引:15
作者
Carlin, Domhnall [1 ]
O'Kane, Philip [1 ]
Sezer, Sakir [1 ]
机构
[1] Queens Univ, Ctr Secure Informat Technol, Belfast, Antrim, North Ireland
基金
英国工程与自然科学研究理事会;
关键词
Malicious code; Network security; Machine learning; Computer security; Malware;
D O I
10.1016/j.cose.2019.04.018
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The ongoing battle between malware distributors and those seeking to prevent the onslaught of malicious code has, so far, favored the former. Anti-virus methods are faltering with the rapid evolution and distribution of new malware, with obfuscation and detection evasion techniques exacerbating the issue. Recent research has monitored low-level opcodes to detect malware. Such dynamic analysis reveals the code at runtime, allowing the true behaviour to be examined. While previous research uses machine learning techniques to accurately detect malware using dynamic runtime opcodes, underpinning datasets have been poorly sampled and inadequate in size. Further, the datasets are always fixed size and no attempt, to our knowledge, has been made to examine the cost of retraining malware classification models on datasets which grow continually. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The research presented here examines this problem, and makes several novel contributions to the current body of knowledge. First, the performance of 23 machine learning algorithms are investigated with respect to the largest run trace dataset in the literature. Second, following an extensive hyperparameter selection process, the performance of each classifier is compared, on both accuracy and computational costs (CPU time). Lastly, the cost of retraining and testing updatable and non-updatable classifiers, both parallelized and non-parallelized, is examined with simulated escalating datasets. This provides insight into how implemented malware classifiers would perform, given simulated dataset escalation. We find that parallelized RandomForest, using 4 cores, provides the optimal performance, with high accuracy and low training and testing times. (C) 2019 Elsevier Ltd. All rights reserved.
引用
收藏
页码:138 / 155
页数:18
相关论文
共 38 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]   Graph-based malware detection using dynamic analysis [J].
Anderson, Blake ;
Quist, Daniel ;
Neil, Joshua ;
Storlie, Curtis ;
Lane, Terran .
JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2011, 7 (04) :247-258
[3]  
[Anonymous], 1995, THESIS STANFORD U
[4]  
Beek C., 2017, McAfee labs threats report
[5]   Opcodes as predictor for malware [J].
Bilar, Daniel .
INTERNATIONAL JOURNAL OF ELECTRONIC SECURITY AND DIGITAL FORENSICS, 2007, 1 (02) :156-168
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]  
Carlin D, 2017, DATA ANAL DECISION S
[8]   The Effects of Traditional Anti-Virus Labels on Malware Detection Using Dynamic Runtime Opcodes [J].
Carlini, Domhnall ;
Cowan, Alexandra ;
O'Kane, Philip ;
Sezer, Sakir .
IEEE ACCESS, 2017, 5 :17742-17752
[9]  
Chawla N.V., 2004, ACMSIGKDD Explor. Newsl., V6, P1, DOI DOI 10.1145/1007730.1007733
[10]  
Cohen William W., 1995, P 12 INT C MACH LEAR, P115, DOI DOI 10.1016/B978-1-55860-377-6.50023-2