Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引:0
作者
Xu, Wenlin [1 ]
Li, Tong [2 ]
Wang, Jinsong [3 ]
Duan, Haibo [3 ]
Tang, Yahui [4 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China
[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China
[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China
来源
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023 | 2024年
关键词
Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;
D O I
10.1109/TrustCom60117.2023.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
引用
收藏
页码:904 / 911
页数:8
相关论文
共 39 条
  • [1] code2vec: Learning Distributed Representations of Code
    Alon, Uri
    Zilberstein, Meital
    Levy, Omer
    Yahav, Eran
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL):
  • [2] A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data
    Aremu, Oluseun Omotola
    Hyland-Wood, David
    McAree, Peter Ross
    [J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 2020, 195
  • [3] Babii H, 2019, Arxiv, DOI [arXiv:1904.01873, DOI 10.48550/ARXIV.1904.01873, 10.48550/arXiv.1904.01873]
  • [4] Boukerche A, 2020, ACM COMPUT SURV, V53, DOI [10.1145/3381028, 10.1145/3421763]
  • [5] From source code identifiers to natural language terms
    Carvalho, Nuno Ramos
    Almeida, Jose Joao
    Henriques, Pedro Rangel
    Varanda, Maria Joao
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2015, 100 : 117 - 128
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] Improving spectral clustering with deep embedding, cluster estimation and metric learning
    Duan, Liang
    Ma, Shuai
    Aggarwal, Charu
    Sathe, Saket
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (03) : 675 - 694
  • [8] [段旭 Duan Xu], 2020, [软件学报, Journal of Software], V31, P3404
  • [9] Simulating SQL injection vulnerability exploitation using Q-learning reinforcement learning agents
    Erdodi, Laszlo
    Sommervoll, Avald Aslaugson
    Zennaro, Fabio Massimo
    [J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2021, 61
  • [10] Feng ZY, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P1536