DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

被引:42
作者
Chen, Yizheng [1 ]
Ding, Zhoujie [2 ]
Alowain, Lamya [3 ]
Chen, Xinyun [4 ]
Wagner, David [2 ]
机构
[1] Univ Maryland, Baltimore, MD 21201 USA
[2] Univ Calif Berkeley, Berkeley, CA USA
[3] King Abdulaziz City Sci & Technol, Riyadh, Saudi Arabia
[4] Google Deepmind, London, England
来源
PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON RESEARCH IN ATTACKS, INTRUSIONS AND DEFENSES, RAID 2023 | 2023年
关键词
datasets; vulnerability detection; deep learning; large language models;
D O I
10.1145/3607199.3607242
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results showthat deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.
引用
收藏
页码:654 / 668
页数:15
相关论文
共 33 条
[1]  
[Anonymous], The nist software assurance reference dataset project
[2]   A Few Billion Lines of Code Later Using Static Analysis to Find Bugs in the Real World [J].
Bessey, Al ;
Block, Ken ;
Chelf, Ben ;
Chou, Andy ;
Fulton, Bryan ;
Hallem, Seth ;
Henri-Gros, Charles ;
Kamsky, Asya ;
McPeak, Scott ;
Engler, Dawson .
COMMUNICATIONS OF THE ACM, 2010, 53 (02) :66-75
[3]   CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software [J].
Bhandari, Guru ;
Naseer, Amara ;
Moonen, Leon .
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING (PROMISE '21), 2021, :30-39
[4]   NatGen: Generative Pre-training by "Naturalizing" Source Code [J].
Chakraborty, Saikat ;
Ahmed, Toufique ;
Ding, Yangruibo ;
Devanbu, Premkumar T. ;
Ray, Baishakhi .
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, :18-30
[5]   Deep Learning Based Vulnerability Detection: Are We There Yet? [J].
Chakraborty, Saikat ;
Krishna, Rahul ;
Ding, Yangruibo ;
Ray, Baishakhi .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (09) :3280-3296
[6]   Building a Commit-level Dataset of Real-world Vulnerabilities [J].
Challande, Alexis ;
David, Robin ;
Renault, Guenael .
CODASPY'22: PROCEEDINGS OF THE TWELVETH ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY, 2022, :101-106
[7]  
Croft Roland, 2023, 2023 IEEE ACM 45 INT
[8]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9]   A C/C plus plus Code Vulnerability Dataset with Code Changes and CVE Summaries [J].
Fan, Jiahao ;
Li, Yi ;
Wang, Shaohua ;
Nguyen, Tien N. .
2020 IEEE/ACM 17TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2020, :508-512
[10]  
Feng ZY, 2020, Arxiv, DOI [arXiv:2002.08155, 10.48550/arXiv.2002.08155]