PromeTrans: Bootstrap binary functionality classification with knowledge transferred from pre-trained models

被引：0

作者：

Sha, Zihan ^{[1
]}

Zhang, Chao ^{[2
]}

Wang, Hao ^{[2
]}

Gao, Zeyu ^{[2
]}

Zhang, Bolun ^{[3
]}

Lan, Yang ^{[2
]}

Shu, Hui ^{[2
]}

机构：

[1] Minist Educ, Key Lab Cyberspace Secur, Zhengzhou 450001, Peoples R China

[2] Tsinghua Univ, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

来源：

EMPIRICAL SOFTWARE ENGINEERING | 2025年 / 30卷 / 01期

关键词：

Static analysis; Large language model; Program comprehension; Neural networks; Reverse engineering; 3RD-PARTY LIBRARY DETECTION;

D O I：

10.1007/s10664-024-10593-y

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Pre-trained models have witnessed significant progress in nature language (including source code) and binary code comprehension. However, none of them are suitable for binary functionality classification (BFC). In this paper, we present the first pre-trained model-based solution to BFC, namely PromeTrans, by fusing the knowledge of pre-trained models. Specifically, it overcomes the token size limitation of pre-trained models with a novel function outlining scheme and utilizes existing pre-trained assembly language models (AsmLMs) to generate embeddings for binary functions. Then, it utilizes a Graph Attention Network (GAT) to aggregate function embeddings following the call graph into a functionality embedding for each function. Lastly, it leverages existing pre-trained large natural language models (LLMs, e.g., GPT-3.5) to classify the functionality of source code functions and align the labels to binary functions. Based on the functionality embedding provided by AsmLMs and GAT and the functionality label knowledge provided by LLMs, a simple multi-layer perceptron (MLP) model is trained to classify the functionality of binary functions. Our prototype PromeTrans yields state-of-the-art (SOTA) performance on various datasets and achieves low overhead. PromeTrans also exhibits exceptional results in real-world applications (e.g., malware analysis). Additionally, by analyzing PromeTrans's training history, we confirm the quality of knowledge transferred from LLMs is high. It shows that transferring knowledge from pre-trained models has a strong potential to bootstrap binary program comprehension tasks beyond BFC.

引用

页数：34

共 74 条

[1] Alrabaee S, 2020, Library Function Identification, P79, DOI [10.1007/978-3-030-34238-84, DOI 10.1007/978-3-030-34238-84]
[2] [Anonymous], 2017, francisck: DanderSpritz docs
[3] Arpit D, 2017, PR MACH LEARN RES, V70
[4] Bubeck S., 2023, SPARKS ARTIFICIAL GE
[5] Carbone M, 2009, CCS'09: PROCEEDINGS OF THE 16TH ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, P555
[6] DIComP: Lightweight Data-Driven Inference of Binary Compiler Provenance with High Accuracy
Chen, Ligeng
He, Zhongling
Wu, Hao
Xu, Fengyuan
Qian, Yi
Mao, Bing
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 112 - 122
[7] Chen Pengfei, 2019, P MACHINE LEARNING R, V97
[8] Identifying Dormant Functionality in Malware Programs
Comparetti, Paolo Milani
Salvaneschi, Guido
Kirda, Engin
Kolbitsch, Clemens
Kruegel, Christopher
Zanero, Stefano
[J]. 2010 IEEE SYMPOSIUM ON SECURITY AND PRIVACY, 2010, : 61 - 76
[9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10] Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
Ding, Steven H. H.
Fung, Benjamin C. M.
Charland, Philippe
[J]. 2019 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2019), 2019, : 472 - 489

← 1 2 3 4 5 6 7 8 →