Data Quality for Software Vulnerability Datasets

被引:39
|
作者
Croft, Roland [1 ,2 ]
Babar, M. Ali [1 ,2 ]
Kholoosi, M. Mehdi [1 ,2 ]
机构
[1] Univ Adelaide, CREST, Sch Comp Sci, Adelaide, SA, Australia
[2] Cyber Secur Cooperat Res Ctr, Joondalup, Australia
关键词
software vulnerability; data quality; machine learning;
D O I
10.1109/ICSE48619.2023.00022
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of learning-based techniques to achieve automated software vulnerability detection has been of long-standing interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.
引用
收藏
页码:121 / 133
页数:13
相关论文
共 50 条
  • [21] Quality problem in software measurement data
    Rebours, Pierre
    Khoshgoftaar, Taghi M.
    ADVANCES IN COMPUTERS, VOL 66: QUALITY SOFTWAVE DEVELOPMENT, 2006, 66 : 43 - 77
  • [22] Data mining for predictors of software quality
    Khoshgoftaar, TM
    Allen, EB
    Jones, WD
    Hudepohl, JP
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 1999, 9 (05) : 547 - 563
  • [23] OBSERVATIONAL SOFTWARE, DATA QUALITY CONTROL AND DATA ANALYSIS
    Hernandez-Mendo, Antonio
    Castellano, Julen
    Camerino, Oleguer
    Jonsson, Gudberg
    Blanco-Villasenor, Angel
    Lopes, Antonio
    Teresa Anguera, M.
    REVISTA DE PSICOLOGIA DEL DEPORTE, 2014, 23 (01): : 111 - 121
  • [24] Data Quality Problems in Software Development Activity Data
    Tu F.-F.
    Zhou M.-H.
    Ruan Jian Xue Bao/Journal of Software, 2019, 30 (05): : 1522 - 1531
  • [25] SAR interferometry: Software, data format, and data quality
    Gens, R
    PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 1999, 65 (12): : 1375 - 1378
  • [26] A Study of Vulnerability Assessment Using Fuzzing Data Suite and Data Flow Analysis in Software
    Song, Jun-Ho
    Park, Jae-Pyo
    Jun, Moon-Seog
    ADVANCED SCIENCE LETTERS, 2016, 22 (09) : 2592 - 2597
  • [27] Benchmarking automated flow cytometry data analysis software using synthetic datasets
    Cheung, M.
    Campbell, J. J.
    Braybrook, J.
    Thomas, R.
    Petzing, J.
    CYTOTHERAPY, 2020, 22 (05) : S38 - S39
  • [28] Crowd-assessing quality in uncertain data linking datasets
    Faria, Daniel
    Ferrara, Alfio
    Jimenez-ruiz, Ernesto
    Montanelli, Stefano
    Pesquita, Catia
    KNOWLEDGE ENGINEERING REVIEW, 2020, 35
  • [29] Data quality mining:: Employing classifiers for assuring consistent datasets
    Gruening, Fabian
    INFORMATION TECHNOLOGIES IN ENVIRONMENTAL ENGINEERING, 2007, : 85 - 94
  • [30] Classification based on Neighborhood from Datasets with Low Quality Data
    Cadenas, J. M.
    Garrido, M. C.
    Martinez, R.
    Munoz-Ledesma, A.
    PROCEEDINGS OF THE 2015 CONFERENCE OF THE INTERNATIONAL FUZZY SYSTEMS ASSOCIATION AND THE EUROPEAN SOCIETY FOR FUZZY LOGIC AND TECHNOLOGY, 2015, 89 : 925 - 932