The impact of IR-based classifier configuration on the performance and the effort of method-level bug localization

被引:23
作者
Tantithamthavorn, Chakkrit
Abebe, Surafel Lemma
Hassan, Ahmed E.
Ihara, Akinori
Matsumoto, Kenichi
机构
[1] The University of Adelaide, Australia
[2] The Addis Ababa University, Ethiopia
[3] Queen's University, Canada
[4] Nara Institute of Science and Technology, Japan
关键词
Bug localization; Classifier configuration; Evaluation metrics; Top-k performance; Effort; PROBABILISTIC RANKING; SOURCE CODE; RETRIEVAL;
D O I
10.1016/j.infsof.2018.06.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: IR-based bug localization is a classifier that assists developers in locating buggy source code entities (e.g., files and methods) based on the content of a bug report. Such IR-based classifiers have various parameters that can be configured differently (e.g., the choice of entity representation). Objective: In this paper, we investigate the impact of the choice of the IR-based classifier configuration on the top-k performance and the required effort to examine source code entities before locating a bug at the method level. Method: We execute a large space of classifier configuration, 3172 in total, on 5266 bug reports of two software systems, i.e., Eclipse and Mozilla. Results: We find that (1) the choice of classifier configuration impacts the top-k performance from 0.44% to 36% and the required effort from 4395 to 50,000 LOC; (2) classifier configurations with similar top-k performance might require different efforts; (3) VSM achieves both the best top-k performance and the least required effort for method-level bug localization; (4) the likelihood of randomly picking a configuration that performs within 20% of the best top-k classifier configuration is on average 5.4% and that of the least effort is on average 1%; (5) configurations related to the entity representation of the analyzed data have the most impact on both the top-k performance and the required effort; and (6) the most efficient classifier configuration obtained at the method level can also be used at the file-level (and vice versa). Conclusion: Our results lead us to conclude that configuration has a large impact on both the top-k performance and the required effort for method-level bug localization, suggesting that the IR-based configuration settings should be carefully selected and the required effort metric should be included in future bug localization studies.
引用
收藏
页码:160 / 174
页数:15
相关论文
共 63 条
[1]  
Anh Tuan Nguyen, 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering, P263, DOI 10.1109/ASE.2011.6100062
[2]  
[Anonymous], 2014, REQUIR ENG
[3]  
[Anonymous], P INT C SOFTW ENG SO
[4]  
[Anonymous], 2018, ARXIV180110271
[5]   A systematic and comprehensive investigation of methods to build and evaluate fault prediction models [J].
Arisholm, Erik ;
Briand, Lionel C. ;
Johannessen, Eivind B. .
JOURNAL OF SYSTEMS AND SOFTWARE, 2010, 83 (01) :2-17
[6]   Configuring latent Dirichlet allocation based feature location [J].
Biggers, Lauren R. ;
Bocovich, Cecylia ;
Capshaw, Riley ;
Eddy, Brian P. ;
Etzkorn, Letha H. ;
Kraft, Nicholas A. .
EMPIRICAL SOFTWARE ENGINEERING, 2014, 19 (03) :465-500
[7]   Fair and Balanced? Bias in Bug-Fix Datasets [J].
Bird, Christian ;
Bachmann, Adrian ;
Aune, Eirik ;
Duffy, John ;
Bernstein, Abraham ;
Filkov, Vladimir ;
Devanbu, Premkumar .
7TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2009, :121-130
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]   A survey on the use of topic models when mining software repositories [J].
Chen, Tse-Hsun ;
Thomas, Stephen W. ;
Hassan, Ahmed E. .
EMPIRICAL SOFTWARE ENGINEERING, 2016, 21 (05) :1843-1919
[10]  
Cleland-Huang Jane, 2014, FOSE, P55, DOI DOI 10.1145/2593882.2593891