Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

被引:253
作者
Ding, Steven H. H. [1 ]
Fung, Benjamin C. M. [1 ]
Charland, Philippe [2 ]
机构
[1] McGill Univ, Sch Informat Studies, Data Min & Secur Lab, Montreal, PQ, Canada
[2] Def R&D Canada Valcartier, Mission Crit Cyber Secur Sect, Quebec City, PQ, Canada
来源
2019 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2019) | 2019年
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1109/SP.2019.00003
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Reverse engineering is a manually intensive but necessary technique for understanding the inner workings of new malware, finding vulnerabilities in existing systems, and detecting patent infringements in released software. An assembly clone search engine facilitates the work of reverse engineers by identifying those duplicated or known parts. However, it is challenging to design a robust clone search engine, since there exist various compiler optimization options and code obfuscation techniques that make logically similar assembly functions appear to be very different. A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have developed an assembly code representation learning model Asm2Vec. It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations.
引用
收藏
页码:472 / 489
页数:18
相关论文
共 47 条
  • [31] Deep learning
    LeCun, Yann
    Bengio, Yoshua
    Hinton, Geoffrey
    [J]. NATURE, 2015, 521 (7553) : 436 - 444
  • [32] Li Z., 2006, IEEE T SOFTWARE ENG, V32
  • [33] Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection
    Luo, Lannan
    Ming, Jiang
    Wu, Dinghao
    Liu, Peng
    Zhu, Sencun
    [J]. 22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, : 389 - 400
  • [34] Mikolov Tomas, 2013, P 1 INT C LEARN REPR
  • [35] Mockus A., 2007, P INT WORKSH EM TREN
  • [36] Myles G., 2005, P 2005 ACM S APPL CO, P314, DOI [10.1145/1066677.1066753, DOI 10.1145/1066677.1066753]
  • [37] Nouh L., 2017, P IFIP INT C ICT SYS
  • [38] Pewny Jannik, 2014, P 30 ANN COMP SEC AP, P406
  • [39] Qiu J., 2015, P 22 IEEE INT C SOFT
  • [40] Le Q, 2014, PR MACH LEARN RES, V32, P1188