VALIDATE: A deep dive into vulnerability prediction datasets

被引:3
作者
Esposito, Matteo [1 ,2 ]
Falessi, Davide [1 ]
机构
[1] Univ Roma Tor Vergata, Via Politecn 1, I-00132 Rome, Lazio, Italy
[2] Multitel Srl, Via Modigliani 27, I-70014 Puglia, Italy
关键词
Security; Replicability; Vulnerability; Machine learning; Repository; Dataset; SOFTWARE; REPRODUCIBILITY; REPOSITORIES; METRICS; IMPACT;
D O I
10.1016/j.infsof.2024.107448
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Vulnerabilities are an essential issue today, as they cause economic damage to the industry and endanger our daily life by threatening critical national security infrastructures. Vulnerability prediction supports software engineers in preventing the use of vulnerabilities by malicious attackers, thus improving the security and reliability of software. Datasets are vital to vulnerability prediction studies, as machine learning models require a dataset. Dataset creation is time-consuming, error-prone, and difficult to validate. Objectives: This study aims to characterise the datasets of prediction studies in terms of availability and features. Moreover, to support researchers in finding and sharing datasets, we provide the first VulnerAbiLty predIction DatAseT rEpository ( VALIDATE ). Methods: We perform a systematic literature review of the datasets of vulnerability prediction studies. Results: Our results show that out of 50 primary studies, only 22 studies (i.e., 38%) provide a reachable dataset. Of these 22 studies, only one study provides a dataset in a stable repository. Conclusions: Our repository of 31 datasets, 22 reachable plus nine datasets provided by authors via email, supports researchers in finding datasets of interest, hence avoiding reinventing the wheel; this translates into less effort, more reliability, and more reproducibility in dataset creation and use.
引用
收藏
页数:18
相关论文
共 193 条
[1]   A Machine Learning Approach to Improve the Detection of CI Skip Commits [J].
Abdalkareem, Rabe ;
Mujahid, Suhaib ;
Shihab, Emad .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (12) :2740-2754
[2]   Investigating effect of Design Metrics on Fault Proneness in Object-Oriented Systems [J].
Aggarwal, K. K. ;
Singh, Yogesh ;
Kaur, Arvinder ;
Malhotra, Ruchika .
JOURNAL OF OBJECT TECHNOLOGY, 2007, 6 (10) :127-141
[3]   INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW ;
KIBLER, D ;
ALBERT, MK .
MACHINE LEARNING, 1991, 6 (01) :37-66
[4]  
Ahluwalia A., 2020, CoRR abs/2003. 14376
[5]   A Modified Maximal Divergence Sequential Auto-Encoder and Time Delay Neural Network Models for Vulnerable Binary Codes Detection [J].
Albahar, Marwan Ali .
IEEE ACCESS, 2020, 8 :14999-15006
[6]   The Effect of the Characteristics of the Dataset on the Selection Stability [J].
Alelyani, Salem ;
Liu, Huan ;
Wang, Lei .
2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, :970-977
[7]   Cyberwarfare and Digital Governance [J].
Almeida, Virgilio A. F. ;
Doneda, Danilo ;
Abreu, Jacqueline de Souza .
IEEE INTERNET COMPUTING, 2017, 21 (02) :68-71
[8]   STATISTICS NOTES - DIAGNOSTIC-TESTS-1 - SENSITIVITY AND SPECIFICITY .3. [J].
ALTMAN, DG ;
BLAND, JM .
BRITISH MEDICAL JOURNAL, 1994, 308 (6943) :1552-1552
[9]   Experimenting Machine Learning Techniques to Predict Vulnerabilities [J].
Alves, Henrique ;
Fonseca, Baldoino ;
Antunes, Nuno .
2016 SEVENTH LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE COMPUTING (LADC), 2016, :151-156
[10]   Software Metrics and Security Vulnerabilities: Dataset and Exploratory Study [J].
Alves, Henrique ;
Fonseca, Baldoino ;
Antunes, Nuno .
2016 12TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2016), 2016, :37-44