CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

被引:101
作者
Bhandari, Guru [1 ]
Naseer, Amara [1 ]
Moonen, Leon [1 ]
机构
[1] Simula Res Lab, Oslo, Norway
来源
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING (PROMISE '21) | 2021年
关键词
Security vulnerabilities; dataset; software repository mining; vulnerability prediction; vulnerability classification; source code repair;
D O I
10.1145/3475960.3475985
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the National Vulnerability Database (NVD). We implement our approach in a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The CVEfixes collection tool automatically fetches all available CVE records from the NVD, gathers the vulnerable code and corresponding fixes from associated open-source repositories, and organizes the collected infonnation in a relational database. Moreover, the dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. The collection can easily be repeated to keep up-to-date with newly discovered or patched vulnerabilities. The initial release of CVEfixes spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754 open-source projects that were addressed in a total of 5495 vulnerability fixing commits. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.
引用
收藏
页码:30 / 39
页数:10
相关论文
共 50 条
[1]   Software Metrics and Security Vulnerabilities: Dataset and Exploratory Study [J].
Alves, Henrique ;
Fonseca, Baldoino ;
Antunes, Nuno .
2016 12TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2016), 2016, :37-44
[2]  
[Anonymous], COMM VULN EXP CVE
[3]  
Black PE, 2018, J RES NATL INST STAN, V123, DOI [10.6028/123.005, 10.6028/jres.123.005]
[4]   Juliet 1.1 C/C++ and Java']Java Test Suite [J].
Boland, Tim ;
Black, Paul E. .
COMPUTER, 2012, 45 (10) :88-90
[5]  
Chen Z., 2019, ARXIV191202015 CORR
[6]   SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair [J].
Chen, Zimin ;
Kommrusch, Steve ;
Tufano, Michele ;
Pouchet, Louis-Noel ;
Poshyvanyk, Denys ;
Monperrus, Martin .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (09) :1943-1959
[7]  
Choi MJ, 2017, PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1546
[8]   Code analysis for intelligent cyber systems: A data-driven approach [J].
Coulter, Rory ;
Han, Qing-Long ;
Pan, Lei ;
Zhang, Jun ;
Xiang, Yang .
INFORMATION SCIENCES, 2020, 524 :46-58
[9]   The Delta Maintainability Model: Measuring Maintainability of Fine-Grained Code Changes [J].
di Biase, Marco ;
Rastogi, Ayushi ;
Bruntink, Magiel ;
van Deursen, Arie .
2019 IEEE/ACM INTERNATIONAL CONFERENCE ON TECHNICAL DEBT (TECHDEBT 2019), 2019, :113-122
[10]   Perceval: Software Project Data at Your Will [J].
Duenas, Santiago ;
Cosentino, Valerio ;
Robles, Gregorio ;
Gonzalez-Barahona, Jesus M. .
PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, :1-4