HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning

被引：14

作者：

Chen, Yu ^{[1
,2
]}

Shi, Zhiqiang ^{[1
,2
]}

Li, Hong ^{[1
,2
]}

Zhao, Weiwei ^{[3
]}

Liu, Yiliang ^{[4
]}

Qiao, Yuansong ^{[5
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

[3] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou, Peoples R China

[4] Beijing Int Studies Univ, Arab Acad, Beijing, Peoples R China

[5] Athlone Inst Technol, Software Res Inst, Athlone, Ireland

来源：

INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1 | 2019年 / 868卷

基金：

中国国家自然科学基金; 爱尔兰科学基金会;

关键词：

Binary analysis; Reverse engineering; RNN; Feature embedding; Model explicable;

D O I：

10.1007/978-3-030-01054-6_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Compiler optimization levels are important for binary analysis, but they are not available in COTS binaries. In this paper, we present the first end-to-end system called HIMALIA which recovers compiler optimization levels from disassembled binary code without any knowledge of the target instruction set semantics. We achieve this by formulating the problem as a deep learning task and training a two layer recurrent neural network. Besides the recurrent neural network, HIMALIA is also powered by two other techniques: instruction embedding and a new function representation method. We implement HIMALIA and carry out comprehensive experiments on our dataset consisting of 378,695 different functions from 5828 binaries compiled by GCC. The results show that HIMALIA exhibits accuracy of around 89%. Moreover, we find that HIMALIA's learnt model is explicable: it can auto-learn common compiler conventions and idioms that match our prior knowledge.

引用

页码：35 / 47

页数：13

共 14 条

[1]

Abou-Assaleh T, 2004, P INT COMP SOFTW APP, P41

[2]

Bao T., 2014, USENIX SEC S

[3]

Caliskanislam A., 2016, CODING STYLE SURVIVE

[4]

Cho Kyunghyun, 2014, C EMPIRICAL METHODS, P1724

[5]

Chua ZL, 2017, PROCEEDINGS OF THE 26TH USENIX SECURITY SYMPOSIUM (USENIX SECURITY '17), P99

[6]

David Y, 2017, ACM SIGPLAN NOTICES, V52, P79, DOI [10.1145/3062341.3062387, 10.1145/3140587.3062387]

[7]

Egele M., 2014, P 23 USENIX C SEC S

[8]

Hoste K, 2008, INT SYM CODE GENER, P165

[9]

Hunt K., 2008, Proceedings of the 23rd National Conference on Artificial Intelligence, P798

[10]

Intel I., 2016, SYSTEM PROGRAMMI A 1

← 1 2 →