Empirical assessment of machine learning-based malware detectors for Android Measuring the gap between in-the-lab and in-the-wild validation scenarios

被引:77
作者
Allix, Kevin [1 ]
Bissyande, Tegawende F. [1 ]
Jerome, Quentin [1 ]
Klein, Jacques [1 ]
State, Radu [1 ]
Le Traon, Yves [1 ]
机构
[1] Univ Luxembourg, Interdisciplinary Ctr Secur Reliabil & Trust, 4 Rue Alphonse Weicker, L-2721 Luxembourg, Luxembourg
关键词
Machine learning; Ten-Fold; Malware; Android; CLASSIFICATION; EXECUTABLES;
D O I
10.1007/s10664-014-9352-6
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
To address the issue of malware detection through large sets of applications, researchers have recently started to investigate the capabilities of machine-learning techniques for proposing effective approaches. So far, several promising results were recorded in the literature, many approaches being assessed with what we call in the lab validation scenarios. This paper revisits the purpose of malware detection to discuss whether such in the lab validation scenarios provide reliable indications on the performance of malware detectors in real-world settings, aka in the wild. To this end, we have devised several Machine Learning classifiers that rely on a set of features built from applications' CFGs. We use a sizeable dataset of over 50 000 Android applications collected from sources where state-of-the art approaches have selected their data. We show that, in the lab, our approach outperforms existing machine learning-based approaches. However, this high performance does not translate in high performance in the wild. The performance gap we observed-F-measures dropping from over 0.9 in the lab to below 0.1 in the wild-raises one important question: How do state-of-the-art approaches perform in the wild?
引用
收藏
页码:183 / 211
页数:29
相关论文
共 36 条
[1]  
Allix K., 2014, P 4 ACM C DAT APPL S, P163
[2]  
Allix K, 2014, COMP SOFTW APPL C CO
[3]  
Amos B, 2013, INT WIREL COMMUN, P1666, DOI 10.1109/IWCMC.2013.6583806
[4]  
AndroGuard, 2013, APKTOOL REV ENG ANDR
[5]  
[Anonymous], 2011, USENIX SECURITY S
[6]  
[Anonymous], 1993, C4 5 PROGRAMS MACHIN
[7]  
[Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
[8]  
AppBrain, 2013, COMP FREE PAID ANDR
[9]  
AppBrain, 2013, NUMB AV ANDR APPL
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32