Software provenance tracking at the scale of public source code

被引:0
作者
Guillaume Rousseau
Roberto Di Cosmo
Stefano Zacchiroli
机构
[1] Université de Paris,
[2] Inria and Université de Paris,undefined
[3] Université de Paris and Inria,undefined
来源
Empirical Software Engineering | 2020年 / 25卷
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
暂无
中图分类号
学科分类号
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:29
相关论文
共 68 条
[1]  
Abramatic J-F(2018)Building the universal archive of source code Commun ACM 61 29-31
[2]  
Di Cosmo R(2002)Statistical mechanics of complex networks Rev Mod Phys 74 47-380
[3]  
Zacchiroli S(2019)Redundancy-free analysis of multi-revision software artifacts Empir Softw Eng 24 332-74
[4]  
Albert R(2007)A history of the history of programming languages Commun ACM 50 69-1437
[5]  
Barabási A(2017)The Debsources dataset: Two decades of free and open source software Empir Softw Eng 22 1405-7:35
[6]  
Alexandru CV(2017)Inner source definition, benefits, and challenges ACM Comput Surv (CSUR) 49 67-1237
[7]  
Panichella S(2008)Free/libre open-source software development: What we know and what we do not know ACM Comput Surv 44 27:1-1187
[8]  
Proksch S(2013)Software bertillonage - determining the provenance of software development artifacts Empir Softw Eng 18 1195-90
[9]  
Gall HC(2002)Evolution of networks Adv Phys 51 1079-28:28
[10]  
Thomas J(2015)Understanding software artifact provenance Sci Comput Program 97 86-578