CDNM: Clustering-Based Data Normalization Method For Automated Vulnerability Detection

被引：1

作者：

Wu, Tongshuai ^{[1
,2
]}

Chen, Liwei ^{[1
,2
]}

Du, Gewangzi ^{[1
,2
]}

Zhu, Chenguang ^{[1
,2
]}

Cui, Ningning ^{[1
,2
]}

Shi, Gang ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

来源：

COMPUTER JOURNAL | 2024年 / 67卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Data Normalization; Clustering; Vulnerability Detection; Deep Learning;

D O I：

10.1093/comjnl/bxad080

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The key to deep learning vulnerability detection framework is pre-processing source code and learning vulnerability features. Traditional source code representation techniques take a complete normalization to user-defined symbols but ignore the semantic information associated with vulnerabilities. The current mainstream vulnerability feature learning model is Recurrent Neural Network (RNN), whose time-series structure determines its insufficient remote information acquisition capability. This paper proposes a new vulnerability detection framework to solve the above problems. We propose a new data normalization method in the source code pre-processing phase. The user-defined symbols are clustered using the unsupervised clustering algorithm K-means. The normalized classification is performed according to the clustering results, which preserves the primary semantic information in the source code and ensures the smoothness of the sample data. In the feature extraction stage, we input the source code after performing text representation into Bidirectional Encoder Representations for Transformers (BERT) for feature automation learning, which enhances semantic information extraction and remote information acquisition. Experimental results show that the vulnerability detection precision of this method is 18.3% higher than that of the current mainstream vulnerability detection framework in the real-world data collected by ourselves. Further, our method improves the precision of the state-of-the-art method by 4.2%.

引用

页码：1538 / 1549

页数：12

共 68 条

[21]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[22]

HP Fortify, US

[23]

Jernite Yacine, 2017, ARXIV

[24]

Joern, US

[25] VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery [J].

Kim, Seulbae ;

Woo, Seunghoon ;

Lee, Heejo ;

Oh, Hakjoo .

2017 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), 2017, :595-614

[26]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[27] Intelligent Fault Diagnosis by Fusing Domain Adversarial Training and Maximum Mean Discrepancy via Ensemble Learning [J].

Li, Yibin ;

Song, Yan ;

Jia, Lei ;

Gao, Shengyao ;

Li, Qiqiang ;

Qiu, Meikang .

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2021, 17 (04) :2833-2841

[28] CEREBRO: Context-Aware Adaptive Fuzzing for Effective Vulnerability Detection [J].

Li, Yuekang ;

Xue, Yinxing ;

Chen, Hongxu ;

Wu, Xiuheng ;

Zhang, Cen ;

Xie, Xiaofei ;

Wang, Haijun ;

Liu, Yang .

ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, :533-544

[29] Steelix: Program-State Based Binary Fuzzing [J].

Li, Yuekang ;

Chen, Bihuan ;

Chandramohan, Mahinthan ;

Lin, Shang-Wei ;

Liu, Yang ;

Tiu, Alwen .

ESEC/FSE 2017: PROCEEDINGS OF THE 2017 11TH JOINT MEETING ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2017, :627-637

[30] VulDeeLocator: A Deep Learning-Based Fine-Grained Vulnerability Detector [J].

Li, Zhen ;

Zou, Deqing ;

Xu, Shouhuai ;

Chen, Zhaoxuan ;

Zhu, Yawei ;

Jin, Hai .

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (04) :2821-2837

← 1 2 3 4 5 6 7 →