Boa: Ultra-Large-Scale Software Repository and Source-Code Mining

被引:60
作者
Dyer, Robert [1 ]
Hoan Anh Nguyen [2 ]
Rajan, Hridesh [2 ]
Nguyen, Tien N. [2 ]
机构
[1] Bowling Green State Univ, Bowling Green, OH 43403 USA
[2] Iowa State Univ, Ames, IA 50011 USA
基金
美国国家科学基金会;
关键词
Boa; mining software repositories; domain-specific language; scalable; ease of use; lower barrier to entry;
D O I
10.1145/2803171
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In today's software-centric world, ultra-large-scale software repositories, such as SourceForge, GitHub, and Google Code, are the new library of Alexandria. They contain an enormous corpus of software and related information. Scientists and engineers alike are interested in analyzing this wealth of information. However, systematic extraction and analysis of relevant data from these repositories for testing hypotheses is hard, and best left for mining software repository (MSR) experts! Specifically, mining source code yields significant insights into software development artifacts and processes. Unfortunately, mining source code at a large scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse grained, or sacrifice studying the history of the code. In this article we address mining source code: (a) at a very large scale; (b) at a fine-grained level of detail; and (c) with full history information. To address these challenges, we present domain-specific language features for source-code mining in our language and infrastructure called Boa. The goal of Boa is to ease testing MSR-related hypotheses. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also show drastic improvements in scalability.
引用
收藏
页数:34
相关论文
共 50 条
[1]  
[Anonymous], 2005, Scientific Programming
[2]  
[Anonymous], NOTICES AM MATH MAR
[3]  
[Anonymous], 1999, Knowledge, Technology & Policy, DOI [DOI 10.1007/S12130-999-1026-0, 10.1007/s12130-999-1026-0]
[4]  
Apache Software Foundation, 2015, HAD OP SOURC IMPL MA
[5]  
Apache Software Foundation, 2015, HBASE OP SOURC IMPL
[6]  
Bevan Jennifer., 2005, ESECFSE 13, P177
[7]   SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS [J].
BLOOM, BH .
COMMUNICATIONS OF THE ACM, 1970, 13 (07) :422-&
[8]   Change Analysis with Evolizer and ChangeDistiller [J].
Call, Harald C. ;
Fluri, Beat ;
Pinzger, Martin .
IEEE SOFTWARE, 2009, 26 (01) :26-33
[9]   FlumeJava']Java: Easy, Efficient Data-Parallel Pipelines [J].
Chambers, Craig ;
Raniwala, Ashish ;
Perry, Frances ;
Adams, Stephen ;
Henry, Robert R. ;
Bradshaw, Robert ;
Weizenbaum, Nathan .
PLDI '10: PROCEEDINGS OF THE 2010 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, 2010, :363-375
[10]   Bigtable: A distributed storage system for structured data [J].
Chang, Fay ;
Dean, Jeffrey ;
Ghemawat, Sanjay ;
Hsieh, Wilson C. ;
Wallach, Deborah A. ;
Burrows, Mike ;
Chandra, Tushar ;
Fikes, Andrew ;
Gruber, Robert E. .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2008, 26 (02)