Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity Hashes in Binary Analysis

被引:25
作者
Pagani, Fabio [1 ]
Dell'Amico, Matteo [2 ]
Balzarotti, Davide [1 ]
机构
[1] EURECOM, Biot, France
[2] Symantec Res Labs, Grenoble, France
来源
PROCEEDINGS OF THE EIGHTH ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY (CODASPY'18) | 2018年
关键词
binary analysis; fuzzy hash; malware; approximate matching; ALGORITHMS;
D O I
10.1145/3176258.3176306
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fuzzy hashing algorithms provide a convenient way of summarizing in a compact form the content of files, and of looking for similarities between them. Because of this, they are widely used in the security and forensics communities to look for similarities between binary program files; one version of them, ssdeep, is the de facto standard to share information about known malware. Fuzzy hashes are quite pervasive, but no study so far answers conclusively the question of which (if any) fuzzy hashing algorithms are suited to detect similarities between programs, where we consider as similar those programs that have code or libraries in common. We measure how four popular algorithms perform in different scenarios: when they are used to correlate statically-compiled files with the libraries they use, when compiled with different flags or different compilers, and when applied to programs that share a large part of their source code. Perhaps more importantly, we provide interpretations that explain the reasons why results vary, sometimes widely, among apparently very similar use cases. We find that the low-level details of the compilation process, together with the technicalities of the hashing algorithms, can explain surprising results such as similarities dropping to zero with the change of a single assembly instruction. More in general, we see that ssdeep, the de facto standard for this type of analysis, performs definitely worse than alternative algorithms; we also find that the best choice of algorithm to use varies depending on the particularities of the use case scenario.
引用
收藏
页码:354 / 365
页数:12
相关论文
共 31 条
[1]  
[Anonymous], 2014, NIST SPECIAL PUBLICA
[2]  
[Anonymous], 2012, Results of SEI Line-Funded Exploratory New Starts Projects
[3]  
[Anonymous], 1981, FINGERPRINTING RANDO
[4]   Mining Malware To Detect Variants [J].
Azab, Ahmad ;
Layton, Robert ;
Alazab, Mamoun ;
Oliver, Jonathan .
2014 5TH CYBERCRIME AND TRUSTWORTHY COMPUTING CONFERENCE CTC, 2014, :44-53
[5]   FRASH: A framework to test algorithms of similarity hashing [J].
Breitinger, Frank ;
Stivaktakis, Georgios ;
Baier, Harald .
DIGITAL INVESTIGATION, 2013, 10 :S50-S58
[6]  
Breitinger F, 2013, L N INST COMP SCI SO, V114, P167
[7]   Automated evaluation of approximate matching algorithms on real data [J].
Breitinger, Frank ;
Roussev, Vassil .
DIGITAL INVESTIGATION, 2014, 11 :S10-S17
[8]  
Chaki Sagar., 2011, Proceedings of the 17th ACM SIGKDD in-ternational conference on Knowledge discovery and data mining, P15
[9]  
Costin A, 2014, USENIX SEC S, P95
[10]  
Damiani E, 2004, PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, P559