Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引:0
作者
Xu, Wenlin [1 ]
Li, Tong [2 ]
Wang, Jinsong [3 ]
Duan, Haibo [3 ]
Tang, Yahui [4 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China
[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China
[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China
来源
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023 | 2024年
关键词
Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;
D O I
10.1109/TrustCom60117.2023.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
引用
收藏
页码:904 / 911
页数:8
相关论文
共 39 条
[1]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[2]   A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data [J].
Aremu, Oluseun Omotola ;
Hyland-Wood, David ;
McAree, Peter Ross .
RELIABILITY ENGINEERING & SYSTEM SAFETY, 2020, 195
[3]  
Babii H, 2019, Arxiv, DOI [arXiv:1904.01873, 10.48550/arXiv.1904.01873]
[4]  
Boukerche A, 2020, ACM COMPUT SURV, V53, DOI [10.1145/3381028, 10.1145/3421763]
[5]   From source code identifiers to natural language terms [J].
Carvalho, Nuno Ramos ;
Almeida, Jose Joao ;
Henriques, Pedro Rangel ;
Varanda, Maria Joao .
JOURNAL OF SYSTEMS AND SOFTWARE, 2015, 100 :117-128
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   Improving spectral clustering with deep embedding, cluster estimation and metric learning [J].
Duan, Liang ;
Ma, Shuai ;
Aggarwal, Charu ;
Sathe, Saket .
KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (03) :675-694
[8]  
[段旭 Duan Xu], 2020, [软件学报, Journal of Software], V31, P3404
[9]   Simulating SQL injection vulnerability exploitation using Q-learning reinforcement learning agents [J].
Erdodi, Laszlo ;
Sommervoll, Avald Aslaugson ;
Zennaro, Fabio Massimo .
JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2021, 61
[10]  
Feng ZY, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P1536