A survey on the use of topic models when mining software repositories

被引:118
作者
Chen, Tse-Hsun [1 ]
Thomas, Stephen W. [1 ]
Hassan, Ahmed E. [1 ]
机构
[1] Queens Univ, SAIL, Kingston, ON, Canada
关键词
Topic modeling; LDA; LSI; Survey; INFORMATION-RETRIEVAL; FEATURE LOCATION; PROBABILISTIC RANKING; TRACEABILITY LINKS; EXECUTION; PREDICTION; COHESION; SYSTEM; COMBINATION; METRICS;
D O I
10.1007/s10664-015-9402-8
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Researchers in software engineering have attempted to improve software development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.
引用
收藏
页码:1843 / 1919
页数:77
相关论文
共 256 条
  • [1] Supporting concept location through identifier parsing and ontology extraction
    Abebe, Surafel Lemma
    Alicante, Anita
    Corazza, Anna
    Tonella, Paolo
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2013, 86 (11) : 2919 - 2938
  • [2] Automatic Software Bug Triage System (BTS) Based on Latent Semantic Indexing and Support Vector Machine
    Ahsan, Syed Nadeem
    Ferzund, Javed
    Wotawa, Franz
    [J]. 2009 FOURTH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING ADVANCES (ICSEA 2009), 2009, : 216 - 221
  • [3] Improving Feature Location by Enhancing Source Code with Stereotypes
    Alhindawi, Nouh
    Dragan, Natalia
    Collard, Michael L.
    Maletic, Jonathan I.
    [J]. 2013 29TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2013, : 300 - 309
  • [4] Alhindawi N, 2013, 2013 7TH INTERNATIONAL WORKSHOP ON TRACEABILITY IN EMERGING FORMS OF SOFTWARE ENGINEERING (TEFSE), P79, DOI 10.1109/TEFSE.2013.6620159
  • [5] Ali N., 2012, 2012 12th IEEE Working Conference on Source Code Analysis and Manipulation (SCAM 2012), P174, DOI 10.1109/SCAM.2012.26
  • [6] An empirical study on the importance of source code entities for requirements traceability
    Ali, Nasir
    Sharafi, Zohreh
    Gueheneuc, Yann-Gael
    Antoniol, Giuliano
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2015, 20 (02) : 442 - 478
  • [7] Ali N, 2012, 2012 28TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), P191, DOI 10.1109/ICSM.2012.6405271
  • [8] Alipour A, 2013, IEEE WORK CONF MIN S, P183, DOI 10.1109/MSR.2013.6624026
  • [9] Allamanis M, 2013, IEEE WORK CONF MIN S, P53, DOI 10.1109/MSR.2013.6624004
  • [10] Andrzejewski D, 2007, LECT NOTES ARTIF INT, V4701, P6