Improving Performance of Automatic Duplicate Bug Reports Detection using Longest Common Sequence Introducing New Textual Features for Textual Similarity Detection
2019 IEEE 5TH CONFERENCE ON KNOWLEDGE BASED ENGINEERING AND INNOVATION (KBEI 2019)
|
2019年
关键词:
Triage System;
Bug Reports;
Duplicate;
Automatic;
Detection;
Text Mining;
Natural Language Processing;
Information Retrieval;
Longest Common Sequence;
D O I:
暂无
中图分类号:
TP39 [计算机的应用];
学科分类号:
081203 ;
0835 ;
摘要:
automatic duplicate bug reports detection is a famous problem in mining software repositories since 2004 for software triage systems e.g. Bugzilla. Textual features are the most important type of features in similarity and duplicate detection e.g. BM25F which indicate the common term frequency in two reports. Sometimes a common sequence can show more similarity in two texts, thus new features based on longest common sequence (LCS) of two bug reports proposed in this paper as new textual features for text similarity detection. Android, Eclipse, Mozilla, and Open Office dataset are used for evaluation of proposed features and the experimental results show LCS-based features are important and the accuracy, precision and recall of classifier prediction models improved 4.5, 2.5 and 2.5 percent respectively on average after using LCS and get up to 96, 98 and 97 percent respectively on average using different classifiers.