Data Quality for Software Vulnerability Datasets

被引:39
|
作者
Croft, Roland [1 ,2 ]
Babar, M. Ali [1 ,2 ]
Kholoosi, M. Mehdi [1 ,2 ]
机构
[1] Univ Adelaide, CREST, Sch Comp Sci, Adelaide, SA, Australia
[2] Cyber Secur Cooperat Res Ctr, Joondalup, Australia
关键词
software vulnerability; data quality; machine learning;
D O I
10.1109/ICSE48619.2023.00022
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of learning-based techniques to achieve automated software vulnerability detection has been of long-standing interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.
引用
收藏
页码:121 / 133
页数:13
相关论文
共 50 条
  • [1] Data Quality: Some Comments on the NASA Software Defect Datasets
    Shepperd, Martin
    Song, Qinbao
    Sun, Zhongbin
    Mair, Carolyn
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2013, 39 (09) : 1208 - 1215
  • [2] An Investigation of Quality Issues in Vulnerability Detection Datasets
    Guo, Yuejun
    Bettaieb, Seifeddine
    2023 IEEE EUROPEAN SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS, EUROS&PW, 2023, : 29 - 33
  • [3] Evaluating the Quality of Datasets in Software Engineering
    Rosli, Marshima Mohd
    Tempero, Ewan
    Luxton-Reilly, Andrew
    ADVANCED SCIENCE LETTERS, 2018, 24 (10) : 7232 - 7239
  • [4] A Stream Processing Software for Air Quality Satellite Datasets
    Semlali, Badr-Eddine Boudriki
    El Amrani, Chaker
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2020), VOL 1, 2022, 1417 : 839 - 853
  • [5] A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road ahead
    Guo, Yuejun
    Bettaieb, Seifeddine
    Casino, Fran
    INTERNATIONAL JOURNAL OF INFORMATION SECURITY, 2024, 23 (05) : 3311 - 3327
  • [6] DATA QUALITY IN NEUROINTENSIVE CARE DATASETS
    Moss, Laura
    Corsar, David
    Hawthorne, Christopher
    Piper, Ian
    Shaw, Martin
    Kinsella, John
    CRITICAL CARE MEDICINE, 2014, 42 (12)
  • [7] Enhancing Software Vulnerability Management with Visualization Data
    Inoue, Akimi
    NTT Technical Review, 2024, 22 (11): : 58 - 63
  • [8] Grey Relational Analysis based k Nearest Neighbor Missing Data Imputation for Software Quality Datasets
    Huang, Jianglin
    Sun, Hongyi
    2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2016), 2016, : 86 - 91
  • [9] Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation
    Bosu, Michael F.
    Macdonell, Stephen G.
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2019, 11 (04):
  • [10] Software quality modeling with multiple datasets using genetic programming
    Liu, Yi
    Khoshgoftaar, Taghi
    ELEVENTH ISSAT INTERNATIONAL CONFERENCE RELIABILITY AND QUALITY IN DESIGN, PROCEEDINGS, 2005, : 92 - 97