Software provenance tracking at the scale of public source code

被引:17
作者
Rousseau, Guillaume [1 ]
Di Cosmo, Roberto [1 ,2 ]
Zacchiroli, Stefano [1 ,2 ]
机构
[1] Univ Paris, Paris, France
[2] INRIA, Paris, France
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
10.1007/s10664-020-09828-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:30
相关论文
共 60 条
[1]   Building the Universal Archive of Source Code [J].
Abramatic, Jean-Francois ;
Di Cosmo, Roberto ;
Zacchiroli, Stefano .
COMMUNICATIONS OF THE ACM, 2018, 61 (10) :29-31
[2]  
Adams B, 2019, P 16 INT C MIN SOFTW, P26
[3]   Statistical mechanics of complex networks [J].
Albert, R ;
Barabási, AL .
REVIEWS OF MODERN PHYSICS, 2002, 74 (01) :47-97
[4]   Redundancy-free analysis of multi-revision software artifacts [J].
Alexandru, Carol V. ;
Panichella, Sebastiano ;
Proksch, Sebastian ;
Gall, Harald C. .
EMPIRICAL SOFTWARE ENGINEERING, 2019, 24 (01) :332-380
[5]  
Alexandru CV, 2017, 2017 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), P148, DOI 10.1109/SANER.2017.7884617
[6]  
Allamanis M, 2013, IEEE WORK CONF MIN S, P207, DOI 10.1109/MSR.2013.6624029
[7]  
[Anonymous], 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007)
[8]   A history of the history of programming languages [J].
Bergin, Thomas J. Tim .
COMMUNICATIONS OF THE ACM, 2007, 50 (05) :69-74
[9]  
Biazzini M, 2014, INT WORKSH EM TRENDS, P37, DOI [10.1145/2593868.2593875, DOI 10.1145/2593868.2593875]
[10]   Understanding the Factors that Impact the Popularity of GitHub Repositories [J].
Borges, Hudson ;
Hora, Andre ;
Valente, Marco Tulio .
32ND IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2016), 2016, :334-344