Sourcerer: An infrastructure for large-scale collection and analysis of open-source code

被引:55
作者
Bajracharya, Sushi [1 ]
Ossher, Joel [1 ]
Lopes, Cristina [1 ]
机构
[1] Univ Calif Irvine, Irvine, CA 92697 USA
关键词
Open source; Internet-scale code retrieval; Data mining; Sourcerer; Static analysis; Software information retrieval; SOFTWARE; SEARCH; REUSE;
D O I
10.1016/j.scico.2012.04.008
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A large amount of open source code is now available online, presenting a great potential resource for software developers. This has motivated software engineering researchers to develop tools and techniques to allow developers to reap the benefits of these billions of lines of source code. However, collecting and analyzing such a large quantity of source code presents a number of challenges. Although the current generation of open source code search engines provides access to the source code in an aggregated repository, they generally fail to take advantage of the rich structural information contained in the code they index. This makes them significantly less useful than Sourcerer for building state-ofthe-art software engineering tools, as these tools often require access to both the structural and textual information available in source code. We have developed Sourcerer, an infrastructure for large-scale collection and analysis of open source code. By taking full advantage of the structural information extracted from source code in its repository, Sourcerer provides a foundation upon which state-ofthe-art search engines and related tools can easily be built. We describe the Sourcerer infrastructure, present the applications that we have built on top of it, and discuss how existing tools could benefit from using Sourcerer. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:241 / 259
页数:19
相关论文
共 59 条
[51]   The Small Project Observatory: Visualizing software ecosystems [J].
Lungu, Mircea ;
Lanza, Michele ;
Girba, Tudor ;
Robbes, Romain .
SCIENCE OF COMPUTER PROGRAMMING, 2010, 75 (04) :264-275
[52]   Jungloid mining:: Helping to navigate the API jungle [J].
Mandelin, D ;
Xu, L ;
Bodík, R ;
Kimelman, D .
ACM SIGPLAN NOTICES, 2005, 40 (06) :48-61
[53]  
McCandless M., 2010, Lucene in Action
[54]   CodeWeb: Data mining library reuse patterns [J].
Michail, A .
PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, 2001, :827-828
[55]  
Ossher Joel, 2010, Proceedings of the 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), P130, DOI 10.1109/MSR.2010.5463346
[56]   SourcererDB: An Aggregated Repository of Statically Analyzed and Cross-Linked Open Source Java']Java Projects [J].
Ossher, Joel ;
Bajracharya, Sushil ;
Linstead, Erik ;
Baldi, Pierre ;
Lopes, Cristina .
2009 6TH IEEE INTERNATIONAL WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES, 2009, :183-186
[57]  
Page R.M. Lawrence., STANFORD DIGITAL LIB
[58]   XSnippet: Mining for sample code [J].
Sahavechaphan, Naiyana ;
Claypool, Kajal .
ACM SIGPLAN NOTICES, 2006, 41 (10) :413-430
[59]  
Thummalapenta Suresh, 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, P327, DOI 10.1109/ASE.2008.43