PALMTREE: Learning an Assembly Language Model for Instruction Embedding

被引:84
作者
Li, Xuezixiang [1 ]
Qu, Yu [1 ]
Yin, Heng [1 ]
机构
[1] Univ Calif Riverside, Riverside, CA 92521 USA
来源
CCS '21: PROCEEDINGS OF THE 2021 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY | 2021年
基金
美国国家科学基金会;
关键词
Deep Learning; Binary Analysis; Language Model; Representation Learning;
D O I
10.1145/3460120.3484587
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning has demonstrated its strengths in numerous binary analysis tasks, including function boundary detection, binary code search, function prototype inference, value set analysis, etc. When applying deep learning to binary analysis tasks, we need to decide what input should be fed into the neural network model. More specifically, we need to answer how to represent an instruction in a fixed-length vector. The idea of automatically learning instruction representations is intriguing, but the existing schemes fail to capture the unique characteristics of disassembly. These schemes ignore the complex intra-instruction structures and mainly rely on control flow in which the contextual information is noisy and can be influenced by compiler optimizations. In this paper, we propose to pre-train an assembly language model called PALMTREE for generating general-purpose instruction embeddings by conducting self-supervised training on large-scale unlabeled binary corpora. PALMTREE utilizes three pre-training tasks to capture various characteristics of assembly language. These training tasks overcome the problems in existing schemes, thus can help to generate high-quality representations. We conduct both intrinsic and extrinsic evaluations, and compare PALMTREE with other instruction embedding schemes. PALMTREE has the best performance for intrinsic metrics, and outperforms the other instruction embedding schemes for all downstream tasks.
引用
收藏
页码:3236 / 3251
页数:16
相关论文
共 43 条
[1]   A Survey of Machine Learning for Big Code and Naturalness [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Devanbu, Premkumar ;
Sutton, Charles .
ACM COMPUTING SURVEYS, 2018, 51 (04)
[2]  
[Anonymous], 2013, P ADV NEUR INF PROC
[3]  
Bakarov Amir, 2018, ABS180109536 CORR
[4]  
Ben-Nun T, 2018, ADV NEUR IN, V31
[5]  
Bengio Y, 2001, ADV NEUR IN, V13, P932
[6]  
Chua ZL, 2017, PROCEEDINGS OF THE 26TH USENIX SECURITY SYMPOSIUM (USENIX SECURITY '17), P99
[7]  
Clark Kevin, 2019, INT C LEARN REPR ICL
[8]  
Dai HJ, 2016, PR MACH LEARN RES, V48
[9]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171