Data Quality for Software Vulnerability Datasets

被引:40
作者
Croft, Roland [1 ,2 ]
Babar, M. Ali [1 ,2 ]
Kholoosi, M. Mehdi [1 ,2 ]
机构
[1] Univ Adelaide, CREST, Sch Comp Sci, Adelaide, SA, Australia
[2] Cyber Secur Cooperat Res Ctr, Joondalup, Australia
来源
2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年
关键词
software vulnerability; data quality; machine learning;
D O I
10.1109/ICSE48619.2023.00022
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of learning-based techniques to achieve automated software vulnerability detection has been of long-standing interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.
引用
收藏
页码:121 / 133
页数:13
相关论文
共 73 条
  • [21] Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey
    Ghaffarian, Seyed Mohammad
    Shahriari, Hamid Reza
    [J]. ACM COMPUTING SURVEYS, 2017, 50 (04)
  • [22] Data quality certification using ISO/IEC 25012: Industrial experiences
    Gualo, Fernando
    Rodriguez, Moises
    Verdugo, Javier
    Caballero, Ismael
    Piattini, Mario
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2021, 176
  • [23] The rise of software vulnerability: Taxonomy of software vulnerabilities detection and machine learning approaches
    Hanif, Hazim
    Nasir, Mohd Hairul Nizam Md
    Ab Razak, Mohd Faizal
    Firdaus, Ahmad
    Anuar, Nor Badrul
    [J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2021, 179
  • [24] He JX, 2022, Arxiv, DOI arXiv:2204.10049
  • [25] A fine-grained data set and analysis of tangling in bug fixing commits
    Herbold, Steffen
    Trautsch, Alexander
    Ledel, Benjamin
    Aghamohammadi, Alireza
    Ghaleb, Taher A.
    Chahal, Kuljit Kaur
    Bossenmaier, Tim
    Nagaria, Bhaveet
    Makedonski, Philip
    Ahmadabadi, Matin Nili
    Szabados, Kristof
    Spieker, Helge
    Madeja, Matej
    Hoy, Nathaniel
    Lenarduzzi, Valentina
    Wang, Shangwen
    Rodriguez-Perez, Gema
    Colomo-Palacios, Ricardo
    Verdecchia, Roberto
    Singh, Paramvir
    Qin, Yihao
    Chakroborti, Debasish
    Davis, Willard
    Walunj, Vijay
    Wu, Hongjun
    Marcilio, Diego
    Alam, Omar
    Aldaeej, Abdullah
    Amit, Idan
    Turhan, Burak
    Eismann, Simon
    Wickert, Anna-Katharina
    Malavolta, Ivano
    Sulir, Matus
    Fard, Fatemeh
    Henley, Austin Z.
    Kourtzanidis, Stratos
    Tuzun, Eray
    Treude, Christoph
    Shamasbi, Simin Maleki
    Pashchenko, Ivan
    Wyrich, Marvin
    Davis, James
    Serebrenik, Alexander
    Albrecht, Ella
    Aktas, Ethem Utku
    Struber, Daniel
    Erbel, Johannes
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (06)
  • [26] The impact of tangled code changes on defect prediction models
    Herzig, Kim
    Just, Sascha
    Zeller, Andreas
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2016, 21 (02) : 303 - 336
  • [27] Herzig K, 2013, IEEE WORK CONF MIN S, P121, DOI 10.1109/MSR.2013.6624018
  • [28] Herzig K, 2013, PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2013), P392, DOI 10.1109/ICSE.2013.6606585
  • [29] Hin David, 2022, arXiv, DOI [10.1145/3524842.3527949, DOI 10.48550/ARXIV.2203.05181]
  • [30] ISO/IEC, 2008, Systems and Software Engineering-System Life Cycle Processes.