Data Quality for Software Vulnerability Datasets

被引：39

作者：

Croft, Roland ^{[1
,2
]}

Babar, M. Ali ^{[1
,2
]}

Kholoosi, M. Mehdi ^{[1
,2
]}

机构：

[1] Univ Adelaide, CREST, Sch Comp Sci, Adelaide, SA, Australia

[2] Cyber Secur Cooperat Res Ctr, Joondalup, Australia

来源：

2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年

关键词：

software vulnerability; data quality; machine learning;

D O I：

10.1109/ICSE48619.2023.00022

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

The use of learning-based techniques to achieve automated software vulnerability detection has been of long-standing interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

引用

页码：121 / 133

页数：13

共 50 条

[21] Quality problem in software measurement data
Rebours, Pierre
Khoshgoftaar, Taghi M.
ADVANCES IN COMPUTERS, VOL 66: QUALITY SOFTWAVE DEVELOPMENT, 2006, 66 : 43 - 77
[22] Data mining for predictors of software quality
Khoshgoftaar, TM
Allen, EB
Jones, WD
Hudepohl, JP
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 1999, 9 (05) : 547 - 563
[23] OBSERVATIONAL SOFTWARE, DATA QUALITY CONTROL AND DATA ANALYSIS
Hernandez-Mendo, Antonio
Castellano, Julen
Camerino, Oleguer
Jonsson, Gudberg
Blanco-Villasenor, Angel
Lopes, Antonio
Teresa Anguera, M.
REVISTA DE PSICOLOGIA DEL DEPORTE, 2014, 23 (01): : 111 - 121
[24] Data Quality Problems in Software Development Activity Data
Tu F.-F.
Zhou M.-H.
Ruan Jian Xue Bao/Journal of Software, 2019, 30 (05): : 1522 - 1531
[25] SAR interferometry: Software, data format, and data quality
Gens, R
PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 1999, 65 (12): : 1375 - 1378
[26] A Study of Vulnerability Assessment Using Fuzzing Data Suite and Data Flow Analysis in Software
Song, Jun-Ho
Park, Jae-Pyo
Jun, Moon-Seog
ADVANCED SCIENCE LETTERS, 2016, 22 (09) : 2592 - 2597
[27] Benchmarking automated flow cytometry data analysis software using synthetic datasets
Cheung, M.
Campbell, J. J.
Braybrook, J.
Thomas, R.
Petzing, J.
CYTOTHERAPY, 2020, 22 (05) : S38 - S39
[28] Crowd-assessing quality in uncertain data linking datasets
Faria, Daniel
Ferrara, Alfio
Jimenez-ruiz, Ernesto
Montanelli, Stefano
Pesquita, Catia
KNOWLEDGE ENGINEERING REVIEW, 2020, 35
[29] Data quality mining:: Employing classifiers for assuring consistent datasets
Gruening, Fabian
INFORMATION TECHNOLOGIES IN ENVIRONMENTAL ENGINEERING, 2007, : 85 - 94
[30] Classification based on Neighborhood from Datasets with Low Quality Data
Cadenas, J. M.
Garrido, M. C.
Martinez, R.
Munoz-Ledesma, A.
PROCEEDINGS OF THE 2015 CONFERENCE OF THE INTERNATIONAL FUZZY SYSTEMS ASSOCIATION AND THE EUROPEAN SOCIETY FOR FUZZY LOGIC AND TECHNOLOGY, 2015, 89 : 925 - 932

← 1 2 3 4 5 →