Data Quality for Software Vulnerability Datasets

被引：40

作者：

Croft, Roland ^{[1
,2
]}

Babar, M. Ali ^{[1
,2
]}

Kholoosi, M. Mehdi ^{[1
,2
]}

机构：

[1] Univ Adelaide, CREST, Sch Comp Sci, Adelaide, SA, Australia

[2] Cyber Secur Cooperat Res Ctr, Joondalup, Australia

来源：

2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年

关键词：

software vulnerability; data quality; machine learning;

D O I：

10.1109/ICSE48619.2023.00022

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

The use of learning-based techniques to achieve automated software vulnerability detection has been of long-standing interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

引用

页码：121 / 133

页数：13

共 73 条

[1] The Adverse Effects of Code Duplication in Machine Learning Models of Code
Allamams, Miltiadis
[J]. PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, : 143 - 153
[2] Cleaning the NVD: Comprehensive Quality Assessment, Improvements, and Analyses
Anwar, Afsah
Abusnaina, Ahmed
Chen, Songqing
Li, Frank
Mohaisen, David
[J]. 51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS - SUPPLEMENTAL VOL (DSN 2021), 2021, : 1 - 2
[3] Arp D, 2022, PROCEEDINGS OF THE 31ST USENIX SECURITY SYMPOSIUM, P3971
[4] Juliet 1.1 C/C++ and Java']Java Test Suite
Boland, Tim
Black, Paul E.
[J]. COMPUTER, 2012, 45 (10) : 88 - 90
[5] Bosu MF, 2013, IEEE AUS SOFT ENGR, P97, DOI 10.1109/ASWEC.2013.21
[6] Braun V., 2006, QUAL RES PSYCHOL, V3, P77, DOI [DOI 10.1191/1478088706QP063OA, 10.1191/1478088706qp063oa]
[7] Deep Learning Based Vulnerability Detection: Are We There Yet?
Chakraborty, Saikat
Krishna, Rahul
Ding, Yangruibo
Ray, Baishakhi
[J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (09) : 3280 - 3296
[8] Cochran W.G., 2007, SAMPLING TECHNIQUES
[9] A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES
COHEN, J
[J]. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) : 37 - 46
[10] Croft R., 2021, P 15 ACM IEEE INT S, DOI DOI 10.1145/3475716.3475781

← 1 2 3 4 5 6 7 8 →