Software provenance tracking at the scale of public source code

被引:0
作者
Guillaume Rousseau
Roberto Di Cosmo
Stefano Zacchiroli
机构
[1] Université de Paris,
[2] Inria and Université de Paris,undefined
[3] Université de Paris and Inria,undefined
来源
Empirical Software Engineering | 2020年 / 25卷
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
暂无
中图分类号
学科分类号
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:29
相关论文
共 68 条
[21]  
Davies J(2009)Automated software license analysis Autom Softw Eng 16 455-undefined
[22]  
Germȧn DM(2017)Analysis of license inconsistency in large collections of open source projects Empir Softw Eng 22 1194-undefined
[23]  
Godfrey MW(undefined)undefined undefined undefined undefined-undefined
[24]  
Hindle A(undefined)undefined undefined undefined undefined-undefined
[25]  
Dorogovtsev SN(undefined)undefined undefined undefined undefined-undefined
[26]  
Mendes JFF(undefined)undefined undefined undefined undefined-undefined
[27]  
Godfrey MW(undefined)undefined undefined undefined undefined-undefined
[28]  
Herraiz I(undefined)undefined undefined undefined undefined-undefined
[29]  
Rodríguez D(undefined)undefined undefined undefined undefined-undefined
[30]  
Robles G(undefined)undefined undefined undefined undefined-undefined