Sourcerer: An infrastructure for large-scale collection and analysis of open-source code

被引:55
作者
Bajracharya, Sushi [1 ]
Ossher, Joel [1 ]
Lopes, Cristina [1 ]
机构
[1] Univ Calif Irvine, Irvine, CA 92697 USA
关键词
Open source; Internet-scale code retrieval; Data mining; Sourcerer; Static analysis; Software information retrieval; SOFTWARE; SEARCH; REUSE;
D O I
10.1016/j.scico.2012.04.008
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A large amount of open source code is now available online, presenting a great potential resource for software developers. This has motivated software engineering researchers to develop tools and techniques to allow developers to reap the benefits of these billions of lines of source code. However, collecting and analyzing such a large quantity of source code presents a number of challenges. Although the current generation of open source code search engines provides access to the source code in an aggregated repository, they generally fail to take advantage of the rich structural information contained in the code they index. This makes them significantly less useful than Sourcerer for building state-ofthe-art software engineering tools, as these tools often require access to both the structural and textual information available in source code. We have developed Sourcerer, an infrastructure for large-scale collection and analysis of open source code. By taking full advantage of the structural information extracted from source code in its repository, Sourcerer provides a foundation upon which state-ofthe-art search engines and related tools can easily be built. We describe the Sourcerer infrastructure, present the applications that we have built on top of it, and discuss how existing tools could benefit from using Sourcerer. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:241 / 259
页数:19
相关论文
共 50 条
  • [21] How to measure a large open-source distributed system
    Thain, Douglas
    Tannenbaum, Todd
    Livny, Miron
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2006, 18 (15) : 1989 - 2019
  • [22] AndroZooOpen: Collecting Large-scale Open Source Android Apps for the Research Community
    Liu, Pei
    Li, Li
    Zhao, Yanjie
    Sun, Xiaoyu
    Grundy, John
    [J]. 2020 IEEE/ACM 17TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2020, : 548 - 552
  • [23] Systemizing Interprocedural Static Analysis of Large-scale Systems Code with Graspan
    Zuo, Zhiqiang
    Wang, Kai
    Hussain, Aftab
    Sani, Ardalan Amiri
    Zhang, Yiyu
    Lu, Shenming
    Dou, Wensheng
    Wang, Linzhang
    Li, Xuandong
    Wang, Chenxi
    Xu, Guoqing Harry
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2021, 38 (1-2):
  • [24] Toward Large-Scale Palmprint Image Analysis by a Rich Orientation Code
    Fan, Dandan
    Liang, Xu
    Jia, Wei
    Zhang, David
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (07): : 4113 - 4125
  • [25] Open-source interface to Bird's DSMC code for complex interaction
    Rose, Martin
    Bird, Graeme A.
    [J]. PROGRESS IN COMPUTATIONAL FLUID DYNAMICS, 2011, 11 (02): : 67 - 75
  • [26] NumCalc: An open-source BEM code for solving acoustic scattering problems
    Kreuzer, Wolfgang
    Pollack, Katharina
    Brinkmann, Fabian
    Majdak, Piotr
    [J]. ENGINEERING ANALYSIS WITH BOUNDARY ELEMENTS, 2024, 161 (157-178) : 157 - 178
  • [27] Open-Source Code-Based Tidal Modeling of Tropical and Temperate Waters
    Srikanth, Narasimalu
    Kannappan, Lakshmanan
    [J]. FRONTIERS IN ENERGY RESEARCH, 2021, 9
  • [28] Mining Preconditions of APIs in Large-Scale Code Corpus
    Hoan Anh Nguyen
    Dyer, Robert
    Nguyen, Tien N.
    Rajan, Hridesh
    [J]. 22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, : 166 - 177
  • [29] VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits
    Perl, Henning
    Dechand, Sergej
    Smith, Matthew
    Arp, Daniel
    Yamaguchi, Fabian
    Rieck, Konrad
    Fahl, Sascha
    Acar, Yasemin
    [J]. CCS'15: PROCEEDINGS OF THE 22ND ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2015, : 426 - 437
  • [30] Large-scale open bioinformatics data resources
    Stupka, E
    [J]. CURRENT OPINION IN MOLECULAR THERAPEUTICS, 2002, 4 (03) : 265 - 274