HAC-T and Fast Search for Similarity in Security

被引:7
作者
Oliver, Jonathan [1 ]
Ali, Muqeet [2 ]
Hagen, Josiah [2 ]
机构
[1] TrendMicro Res, North Sydney, NSW, Australia
[2] TrendMicro Res, Irving, TX USA
来源
2020 INTERNATIONAL CONFERENCE ON OMNI-LAYER INTELLIGENT SYSTEMS (IEEE COINS 2020) | 2020年
关键词
Clustering; Hierarchical Agglomerative Clustering (HAC); Approximate Nearest Neighbour; Fuzzy Hashing; Trend Locality Sensitive Hashing (TLSH);
D O I
10.1109/coins49042.2020.9191381
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O(n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.
引用
收藏
页码:265 / 271
页数:7
相关论文
共 26 条
[1]  
[Anonymous], LOCALITY SENSITIVE H
[2]  
Bayer Ulrich, 2009, 16 ANN NETW DISTR SY
[3]  
Botocan OB, 2017, INT C INTELL COMP CO, P5, DOI 10.1109/ICCP.2017.8116976
[4]   Quantifying the Effectiveness of Software Diversity using Near-Duplicate Detection Algorithms [J].
Coffman, Joel ;
Chakravarty, Aurin ;
Russo, Joshua A. ;
Gearhart, Andrew S. .
PROCEEDINGS OF THE 5TH ACM WORKSHOP ON MOVING TARGET DEFENSE (MTD'18), 2018, :1-10
[5]  
Dell'Amico M, 2019, Arxiv, DOI arXiv:1910.07283
[6]  
Ester M., 1996, P 2 INT C KNOWL DISC, P226, DOI DOI 10.5555/3001460.3001507
[7]   Performance Evaluation of Features and Clustering Algorithms for Malware [J].
Faridi, Houtan ;
Srinivasagopalan, Srivathsan ;
Verma, Rakesh .
2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, :13-22
[8]  
Hu Xin, 2013, 2013 USENIX ANN TECH
[9]  
Jang JY, 2011, PROCEEDINGS OF THE 18TH ACM CONFERENCE ON COMPUTER & COMMUNICATIONS SECURITY (CCS 11), P309
[10]  
Li Y., 2015, 8 WORKSH CYB SEC EXP