PromeTrans: Bootstrap binary functionality classification with knowledge transferred from pre-trained models

被引:0
作者
Sha, Zihan [1 ]
Zhang, Chao [2 ]
Wang, Hao [2 ]
Gao, Zeyu [2 ]
Zhang, Bolun [3 ]
Lan, Yang [2 ]
Shu, Hui [2 ]
机构
[1] Minist Educ, Key Lab Cyberspace Secur, Zhengzhou 450001, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
关键词
Static analysis; Large language model; Program comprehension; Neural networks; Reverse engineering; 3RD-PARTY LIBRARY DETECTION;
D O I
10.1007/s10664-024-10593-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Pre-trained models have witnessed significant progress in nature language (including source code) and binary code comprehension. However, none of them are suitable for binary functionality classification (BFC). In this paper, we present the first pre-trained model-based solution to BFC, namely PromeTrans, by fusing the knowledge of pre-trained models. Specifically, it overcomes the token size limitation of pre-trained models with a novel function outlining scheme and utilizes existing pre-trained assembly language models (AsmLMs) to generate embeddings for binary functions. Then, it utilizes a Graph Attention Network (GAT) to aggregate function embeddings following the call graph into a functionality embedding for each function. Lastly, it leverages existing pre-trained large natural language models (LLMs, e.g., GPT-3.5) to classify the functionality of source code functions and align the labels to binary functions. Based on the functionality embedding provided by AsmLMs and GAT and the functionality label knowledge provided by LLMs, a simple multi-layer perceptron (MLP) model is trained to classify the functionality of binary functions. Our prototype PromeTrans yields state-of-the-art (SOTA) performance on various datasets and achieves low overhead. PromeTrans also exhibits exceptional results in real-world applications (e.g., malware analysis). Additionally, by analyzing PromeTrans's training history, we confirm the quality of knowledge transferred from LLMs is high. It shows that transferring knowledge from pre-trained models has a strong potential to bootstrap binary program comprehension tasks beyond BFC.
引用
收藏
页数:34
相关论文
共 74 条
  • [1] Alrabaee S, 2020, Library Function Identification, P79, DOI [10.1007/978-3-030-34238-84, DOI 10.1007/978-3-030-34238-84]
  • [2] [Anonymous], 2017, francisck: DanderSpritz docs
  • [3] Arpit D, 2017, PR MACH LEARN RES, V70
  • [4] Bubeck S., 2023, SPARKS ARTIFICIAL GE
  • [5] Carbone M, 2009, CCS'09: PROCEEDINGS OF THE 16TH ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, P555
  • [6] DIComP: Lightweight Data-Driven Inference of Binary Compiler Provenance with High Accuracy
    Chen, Ligeng
    He, Zhongling
    Wu, Hao
    Xu, Fengyuan
    Qian, Yi
    Mao, Bing
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 112 - 122
  • [7] Chen Pengfei, 2019, P MACHINE LEARNING R, V97
  • [8] Identifying Dormant Functionality in Malware Programs
    Comparetti, Paolo Milani
    Salvaneschi, Guido
    Kirda, Engin
    Kolbitsch, Clemens
    Kruegel, Christopher
    Zanero, Stefano
    [J]. 2010 IEEE SYMPOSIUM ON SECURITY AND PRIVACY, 2010, : 61 - 76
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
    Ding, Steven H. H.
    Fung, Benjamin C. M.
    Charland, Philippe
    [J]. 2019 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2019), 2019, : 472 - 489