Automated triaging of very large bug repositories

被引:24
作者
Banerjee, Sean [1 ]
Syed, Zahid [2 ]
Helmick, Jordan [3 ]
Culp, Mark [4 ]
Ryan, Kenneth [4 ]
Cukic, Bojan [5 ]
机构
[1] Clarkson Univ, Potsdam, NY 13699 USA
[2] Univ Michigan, Flint, MI 48503 USA
[3] MedExpress, Morgantown, WV USA
[4] West Virgina Univ, Morgantown, WV USA
[5] Univ North Carolina Charlotte, Charlotte, NC USA
关键词
Automated triaging; Bug tracking; Big data analytics; Software problem repositories;
D O I
10.1016/j.infsof.2016.09.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Bug tracking systems play an important role in software maintenance. They allow both developers and users to submit problem reports on observed failures. However, by allowing anyone to submit problem reports, it is likely that more than one reporter will report on the same issue. Research in open source repositories has focused on two broad areas: determining the original report associated with each known duplicate, and assigning a developer to fix a particular problem. Objective: Limited research has been done in developing a fully automated triager, one that can first ascertain if a problem report is original or duplicate, and then provide a list of 20 potential matches for a duplicate report. We address this limitation by developing an automated triaging system that can be used to assist human triagers in bug tracking systems. Method: Our automated triaging system automatically assigns a label of original or duplicate to each incoming problem report, and provides a list of 20 suggestions for reports classified as duplicate. The system uses 24 document similarity measures and associated summary statistics, along with a suite of document property and user metrics. We perform our research on a lifetime of problem reports from the Eclipse, Firefox and Open Office repositories. Results: Our system can be used as a filtration aide, with high original recall exceeding 95% and low duplicate recall, or as a triaging guide, with balanced recall of approximately 70% for both originals and duplicates. Furthermore, the system reduces the workload on the triager by over 90%. Conclusions: Our work represents the first full scale effort at automatically triaging problem reports in open source repositories. By utilizing multiple similarity measures, we reduce the potential of false matches caused by the diversity of human language. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:1 / 13
页数:13
相关论文
共 23 条
  • [1] Nguyen AT, 2012, IEEE INT CONF AUTOM, P70, DOI 10.1145/2351676.2351687
  • [2] [Anonymous], 2006, THESIS
  • [3] Anvik J., 2005, P 2005 OOPSLA WORKSH, P35, DOI [10.1145/1117696, 10.1145/1117696.1117704, DOI 10.1145/1117696]
  • [4] Banerjee S., 2012, 2012 IEEE 14th International Symposium on High-Assurance Systems Engineering (HASE 2012), P74, DOI 10.1109/HASE.2012.38
  • [5] On the cost of mining very large open source repositories
    Banerjee, Sean
    Cukic, Bojan
    [J]. 2015 IEEE/ACM 1ST INTERNATIONAL WORKSHOP ON BIG DATA SOFTWARE ENGINEERING, 2015, : 37 - 43
  • [6] Eclipse vs. Mozilla: A Comparison of Two Large-Scale Open Source Problem Report Repositories
    Banerjee, Sean
    Helmick, Jordan
    Syed, Zahid
    Cukic, Bojan
    [J]. 2015 IEEE 16TH INTERNATIONAL SYMPOSIUM ON HIGH ASSURANCE SYSTEMS ENGINEERING (HASE), 2015, : 263 - 270
  • [7] Banerjee S, 2013, PROC INT SYMP SOFTW, P208, DOI 10.1109/ISSRE.2013.6698920
  • [8] The use of the area under the roc curve in the evaluation of machine learning algorithms
    Bradley, AP
    [J]. PATTERN RECOGNITION, 1997, 30 (07) : 1145 - 1159
  • [9] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [10] COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH
    DELONG, ER
    DELONG, DM
    CLARKEPEARSON, DI
    [J]. BIOMETRICS, 1988, 44 (03) : 837 - 845