AVCLASS: A Tool for Massive Malware Labeling

被引:266
作者
Sebastian, Marcos [1 ]
Rivera, Richard [1 ,2 ]
Kotzias, Platon [1 ,2 ]
Caballero, Juan [1 ]
机构
[1] IMDEA Software Inst, Madrid, Spain
[2] Univ Politecn Madrid, Madrid, Spain
来源
RESEARCH IN ATTACKS, INTRUSIONS, AND DEFENSES, RAID 2016 | 2016年 / 9854卷
关键词
Malware labeling; AV labels; Classification; Clustering;
D O I
10.1007/978-3-319-45719-2_11
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Labeling a malicious executable as a variant of a known family is important for security applications such as triage, lineage, and for building reference datasets in turn used for evaluating malware clustering and training malware classification approaches. Oftentimes, such labeling is based on labels output by antivirus engines. While AV labels are well-known to be inconsistent, there is often no other information available for labeling, thus security analysts keep relying on them. However, current approaches for extracting family information from AV labels are manual and inaccurate. In this work, we describe AVCLASS, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample. AVCLASS implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection. We have evaluated AVCLASS on 10 datasets comprising 8.9 M samples, larger than any dataset used by malware clustering and classification works. AVCLASS leverages labels from any AV engine, e.g., all 99 AV engines seen in VirusTotal, the largest engine set in the literature. AVCLASS's clustering achieves F1 measures up to 93.9 on labeled datasets and clusters are labeled with fine-grained family names commonly used by the AV vendors. We release AVCLASS to the community.
引用
收藏
页码:230 / 253
页数:24
相关论文
共 30 条
[1]  
[Anonymous], 2016, DETECTION INTRUSIONS
[2]  
[Anonymous], USENIX SEC S
[3]  
[Anonymous], DETECTION INTRUSIONS
[4]  
[Anonymous], 2013, IEEE INT C AC SPEECH
[5]  
[Anonymous], ANN COMP SEC APPL C
[6]  
Arp Daniel., 2014, NETWORK DISTRIBUTED
[7]  
Bailey M., 2007, INT S REC ADV INTR D
[8]  
Bayer U., 2009, ETWORK DISTRIBUTED S
[9]  
Beck D, 2006, P VIR B C
[10]  
Bureau P.-M., 2008, VIR B C