Just enough semantics: An information theoretic approach for IR-based software bug localization

被引:33
作者
Khatiwada, Saket [1 ]
Tushev, Miroslav [1 ]
Mahmoud, Anas [1 ]
机构
[1] Louisiana State Univ, Div Comp Sci & Engn, Baton Rouge, LA 70803 USA
关键词
Information retrieval; Bug localization; Information theory; LATENT DIRICHLET ALLOCATION; SOURCE-CODE; TRACEABILITY LINKS; RETRIEVAL; LOCATION; REPRESENTATIONS; DOCUMENTATION; COMPREHENSION; EVOLUTION; KNOWLEDGE;
D O I
10.1016/j.infsof.2017.08.012
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Software systems are often shipped with defects. Whenever a bug is reported, developers use the information available in the associated report to locate source code fragments that need to be modified in order to fix the bug. However, as software systems evolve in size and complexity, bug localization can become a tedious and time-consuming process. To minimize the manual effort, contemporary bug localization tools utilize Information Retrieval (IR) methods for automated support. IR methods exploit the textual content of bug reports to automatically capture and rank relevant buggy source files. Objective: In this paper, we propose a new paradigm of information-theoretic IR methods to support bug localization tasks in software systems. These methods, including Pointwise Mutual Information (PMI) and Normalized Google Distance (NGD), exploit the co-occurrence patterns of code terms in the software system to reveal hidden textual semantic dimensions that other methods often fail to capture. Our objective is establish accurate semantic similarity relations between source code and bug reports. Method: Five benchmark datasets from different application domains are used to conduct our analysis. The proposed methods are compared against classical IR methods that are commonly used in bug localization research. Results: The results show that information-theoretic IR methods significantly outperform classical IR methods, providing a semantically aware, yet, computationally efficient solution for bug localization in large and complex software systems. (A replication package is available at: http://seel.cseisu.edu/datai istl7.zip). Conclusions: Information-theoretic co-occurrence methods provide "just enough semantics" necessary to establish relations between bug reports and code artifacts, achieving a balance between simple lexical methods and computationally-expensive semantic IR methods that require substantial amounts of data to function properly. (c) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:45 / 57
页数:13
相关论文
共 88 条
  • [1] A traceability technique for specifications
    Abadi, Aharcin
    Nisenson, Mordechai
    Simionovici, Yahalomit
    [J]. PROCEEDINGS OF THE 16TH IEEE INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, 2008, : 103 - 112
  • [2] Analyzing the Evolution of the Source Code Vocabulary
    Abebe, Surafel Lemma
    Haiduc, Sonia
    Marcus, Andrian
    Tonella, Paolo
    Antoniol, Giuliano
    [J]. 13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR 2009, PROCEEDINGS, 2009, : 189 - 198
  • [3] Anh Tuan Nguyen, 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering, P263, DOI 10.1109/ASE.2011.6100062
  • [4] [Anonymous], P 21 NATL C ART INT
  • [5] [Anonymous], 2009, N AM CHAPTER ASS COM
  • [6] [Anonymous], 2013, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers
  • [7] [Anonymous], 2008, Introduction to information retrieval
  • [8] [Anonymous], 1988, C HUM FACT COMP SY F, DOI DOI 10.1145/57167.57214
  • [9] [Anonymous], 2001, P 12 EUR C MACH LEAR, DOI DOI 10.1007/3-540-44795-4_42
  • [10] [Anonymous], 2005, P 6 INT S AUTOMATED