Data Quality for Software Vulnerability Datasets

被引:40
作者
Croft, Roland [1 ,2 ]
Babar, M. Ali [1 ,2 ]
Kholoosi, M. Mehdi [1 ,2 ]
机构
[1] Univ Adelaide, CREST, Sch Comp Sci, Adelaide, SA, Australia
[2] Cyber Secur Cooperat Res Ctr, Joondalup, Australia
来源
2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年
关键词
software vulnerability; data quality; machine learning;
D O I
10.1109/ICSE48619.2023.00022
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of learning-based techniques to achieve automated software vulnerability detection has been of long-standing interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.
引用
收藏
页码:121 / 133
页数:13
相关论文
共 73 条
  • [1] The Adverse Effects of Code Duplication in Machine Learning Models of Code
    Allamams, Miltiadis
    [J]. PROCEEDINGS OF THE 2019 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD!' 19), 2019, : 143 - 153
  • [2] Cleaning the NVD: Comprehensive Quality Assessment, Improvements, and Analyses
    Anwar, Afsah
    Abusnaina, Ahmed
    Chen, Songqing
    Li, Frank
    Mohaisen, David
    [J]. 51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS - SUPPLEMENTAL VOL (DSN 2021), 2021, : 1 - 2
  • [3] Arp D, 2022, PROCEEDINGS OF THE 31ST USENIX SECURITY SYMPOSIUM, P3971
  • [4] Juliet 1.1 C/C++ and Java']Java Test Suite
    Boland, Tim
    Black, Paul E.
    [J]. COMPUTER, 2012, 45 (10) : 88 - 90
  • [5] Bosu MF, 2013, IEEE AUS SOFT ENGR, P97, DOI 10.1109/ASWEC.2013.21
  • [6] Braun V., 2006, QUAL RES PSYCHOL, V3, P77, DOI [DOI 10.1191/1478088706QP063OA, 10.1191/1478088706qp063oa]
  • [7] Deep Learning Based Vulnerability Detection: Are We There Yet?
    Chakraborty, Saikat
    Krishna, Rahul
    Ding, Yangruibo
    Ray, Baishakhi
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (09) : 3280 - 3296
  • [8] Cochran W.G., 2007, SAMPLING TECHNIQUES
  • [9] A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES
    COHEN, J
    [J]. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) : 37 - 46
  • [10] Croft R., 2021, P 15 ACM IEEE INT S, DOI DOI 10.1145/3475716.3475781