Source code analysis dataset

被引:6
作者
Gelman, Ben [1 ]
Obayomi, Banjo [1 ]
Moore, Jessica [1 ]
Slater, David [1 ]
机构
[1] Two Six Labs, Machine Learning Grp, 901 N Stuart St,Suite 1000, Arlington, VA 22203 USA
关键词
Source code; Code comments; Bug detection; Static analysis;
D O I
10.1016/j.dib.2019.104712
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery. (c) 2019 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
引用
收藏
页数:6
相关论文
共 4 条
[1]  
GitHub Developer, GRAPHQL API V4
[2]  
Kerrisk Michael., Linux man-pages
[3]  
Moore Jessica, 2019, CONVOLUTIONAL NEURAL
[4]  
van Heesch D., 2008, Doxygen: Source code documentation generator tool