Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering

被引:45
作者
Ding, Steven H. H. [1 ]
Fung, Benjamin C. M. [1 ]
Charland, Philippe [2 ]
机构
[1] McGill Univ, Sch Informat Studies, Montreal, PQ, Canada
[2] Def R&D Canada Valcartier, Mission Crit Cyber Secur Sect, Quebec City, PQ, Canada
来源
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2016年
关键词
Asseffibly clone search; Information retrieval; Mining software repositories;
D O I
10.1145/2939672.2939719
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of tins process, since it call identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a. new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kom1n0. It is the first clone search engine that can efficiently identify the given query assembly function's subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available.(1) Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code.
引用
收藏
页码:461 / 470
页数:10
相关论文
共 36 条
[1]  
AHO AV, 1976, J ACM, V23, P1, DOI 10.1145/321921.321922
[2]   Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions [J].
Andoni, Alexandr ;
Indyk, Piotr .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :117-122
[3]  
[Anonymous], 2014, Hashing for similarity search: A survey
[4]  
[Anonymous], 2006, PATTERN RECOGNITION
[5]  
[Anonymous], 1999, P VLDB
[6]  
[Anonymous], 2008, INTRO INFORM RETRIEV, DOI DOI 10.1017/CBO9780511809071
[7]  
Bawa M., 2005, WWW 05
[8]  
Charikar M., 2002, P 34 ANN ACM STOC 02
[9]  
Charland P., 2012, P NATO RTO S INF ASS
[10]   On the Set Multicover Problem in Geometric Settings [J].
Chekuri, Chandra ;
Clarkson, Kenneth L. ;
Har-Peled, Sariel .
ACM TRANSACTIONS ON ALGORITHMS, 2012, 9 (01)