Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models

被引:1
|
作者
Ito, Norimitsu [1 ,2 ]
Hashimoto, Masaki [2 ]
Otsuka, Akira [2 ]
机构
[1] Natl Police Acad, Police Info Commun Res Ctr, Fuchu, Tokyo 1838558, Japan
[2] Inst Informat Secur, Yokohama, Kanagawa 2210835, Japan
来源
IEEE ACCESS | 2023年 / 11卷
关键词
Feature extraction; Binary codes; Semantics; Software engineering; Computer architecture; Training; Source coding; Machine learning; Neural networks; Machine translation; Binary code similarity detection; machine learning; neural machine translation;
D O I
10.1109/ACCESS.2023.3316215
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Binary code similarity detection is an effective analysis technique for vulnerability, bug, and plagiarism detection in software for which the source code cannot be obtained. The recent proliferation of IoT devices has also increased the demand for similarity detection across different architectures. However, there are currently not many examples of feature extraction methods using neural machine translation (NMT) models being applied to similarity detection in basic block units across different architectures. In this research, we propose new methods that extract features at a higher speed and detect similarities across different architectures with higher accuracy than existing methods for basic block feature extraction using neural machine translation models. We assume that the intermediate representation of the NMT model, which learned the translation of basic blocks across different architectures, includes the semantics of the instructions in the basic block. Hence we adopted the intermediate representation as the features of the basic blocks. Then, we applied the linear transformation used in bilingual word embedding to match the embedding space of basic blocks across different architectures. This enables the similarity detection in basic block units across different architectures with higher accuracy than the distance learning method used in existing research to match the embedding space. In the evaluation experiment, we compare the Precision at k (P@k) on the same dataset with existing research methods and our method achieved the highest accuracy of 92%. In addition, We also compare the time required for feature extraction using GPUs, and found that it was up to 16 times faster.
引用
收藏
页码:102796 / 102805
页数:10
相关论文
共 50 条
  • [1] Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs
    Zuo, Fei
    Li, Xiaopeng
    Young, Patrick
    Luo, Lannan
    Zeng, Qiang
    Zhang, Zhexin
    26TH ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2019), 2019,
  • [2] Neural Machine Translation via Binary Code Prediction
    Oda, Yusuke
    Arthur, Philip
    Neubig, Graham
    Yoshino, Koichiro
    Nakamura, Satoshi
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 850 - 860
  • [3] CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction Techniques
    Jia, Lichen
    Wu, Chenggang
    Zhang, Peihua
    Wang, Zhe
    PROCEEDINGS OF THE 25TH ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, LCTES 2024, 2024, : 143 - 154
  • [4] Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery
    Ahmad, Iftakhar
    Luo, Lannan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14581 - 14592
  • [5] Binary Code Similarity Detection
    Liu, Zian
    2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 1056 - 1060
  • [6] Binary Similarity Detection Using Machine Learning
    Shalev, Noam
    Partush, Nimrod
    PLAS'18: PROCEEDINGS OF THE 13TH WORKSHOP ON PROGRAMMING LANGUAGES AND ANALYSIS FOR SECURITY, 2018, : 42 - 47
  • [7] Using Neural Machine Translation Methods for Sign Language Translation
    Angelova, Galina
    Avramidis, Eleftherios
    Moeller, Sebastian
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 273 - 284
  • [8] Multi-semantic feature fusion attention network for binary code similarity detection
    Bangling Li
    Yuting Zhang
    Huaxi Peng
    Qiguang Fan
    Shen He
    Yan Zhang
    Songquan Shi
    Yang Zhang
    Ailiang Ma
    Scientific Reports, 13
  • [9] Multi-semantic feature fusion attention network for binary code similarity detection
    Li, Bangling
    Zhang, Yuting
    Peng, Huaxi
    Fan, Qiguang
    He, Shen
    Zhang, Yan
    Shi, Songquan
    Zhang, Yang
    Ma, Ailiang
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [10] Paraphrase Detection Using Machine Translation and Textual Similarity Algorithms
    Kravchenko, Dmitry
    ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE, 2018, 789 : 277 - 292