Handling Language Variations in Open Source Bug Reporting Systems

被引:1
作者
Banerjee, Sean [1 ]
Musgrove, Jesse [1 ]
Cukic, Bojan [1 ]
机构
[1] W Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
来源
23RD IEEE INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING WORKSHOPS (ISSRE 2012) | 2012年
关键词
Typographical Errors; Alternate Spellings; Duplicate Bug Reports; String Algorithms; Software Maintenance; Software Reliability;
D O I
10.1109/ISSREW.2012.85
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form English. However, the free form aspect makes the detection of duplicate reports a challenge due to the breadth and diversity of language used by individual reporters. Tokenization, stemming and stop word removal are commonly used techniques to normalize and reduce the language space. However, the impact of typographical errors and alternate spellings has not been analyzed in the research literature. Our research indicates that handling language problems during automated bug triage analysis can lead to a boost in performance. We show that the language used in software problem reporting is too specialized to benefit from domain independent spell checkers or lexical databases. Therefore, we present a novel approach using word distance and neighbor word likelihood measures for detecting and resolving language-based issues in open-source software problem reporting. We evaluate our approach using the complete Firefox repository until March 2012. Our results indicate measurable improvements in duplicate detection results, while reducing the language space for most frequently used words by 30%. Moreover, our method is language-agnostic and does not require a pre-built dictionary, thus making it suitable for use in a variety of systems.
引用
收藏
页码:325 / 330
页数:6
相关论文
共 14 条
  • [1] [Anonymous], 2010, 6 IEEE INT C NAT LAN
  • [2] [Anonymous], P NODALIDA
  • [3] Banko M, 2001, P 39 ANN M ASS COMP
  • [4] Black P.E., 2008, DICT ALGORITHMS DATA
  • [5] Memory-based context-sensitive spelling correction at web scale
    Carlson, Andrew
    Fette, Ian
    [J]. ICMLA 2007: SIXTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2007, : 166 - +
  • [6] Feldman R., 2006, TEXT MINING HDB ADV
  • [7] Fellbaum C., 1998, WordNet, DOI DOI 10.7551/MITPRESS/7287.001.0001
  • [8] Golding A., 1996, P 34 ANN M ASS COMPU, P71
  • [9] A Winnow-based approach to context-sensitive spelling correction
    Golding, AR
    Roth, D
    [J]. MACHINE LEARNING, 1999, 34 (1-3) : 107 - 130
  • [10] Hatcher E., 2004, Lucene in Action (in Action Series