A survey on the use of topic models when mining software repositories

被引:118
作者
Chen, Tse-Hsun [1 ]
Thomas, Stephen W. [1 ]
Hassan, Ahmed E. [1 ]
机构
[1] Queens Univ, SAIL, Kingston, ON, Canada
关键词
Topic modeling; LDA; LSI; Survey; INFORMATION-RETRIEVAL; FEATURE LOCATION; PROBABILISTIC RANKING; TRACEABILITY LINKS; EXECUTION; PREDICTION; COHESION; SYSTEM; COMBINATION; METRICS;
D O I
10.1007/s10664-015-9402-8
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Researchers in software engineering have attempted to improve software development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.
引用
收藏
页码:1843 / 1919
页数:77
相关论文
共 256 条
  • [31] Theory of Aspects as Latent Topics
    Baldi, Pierre F.
    Lopes, Cristina V.
    Linstead, Erik J.
    Bajracharya, Sushil K.
    [J]. ACM SIGPLAN NOTICES, 2008, 43 (10) : 543 - 562
  • [32] Matching words and pictures
    Barnard, K
    Duygulu, P
    Forsyth, D
    de Freitas, N
    Blei, DM
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) : 1107 - 1135
  • [33] Barua A., 2012, J ESME 12, V19, P619
  • [34] Bassett B, 2013, CONF PROC INT SYMP C, P133, DOI 10.1109/ICPC.2013.6613841
  • [35] Bavota G., 2012, EMPIR SOFTW ENG, V18, P901
  • [36] Methodbook: Recommending Move Method Refactorings via Relational Topic Models
    Bavota, Gabriele
    Oliveto, Rocco
    Gethers, Malcom
    Poshyvanyk, Denys
    De Lucia, Andrea
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2014, 40 (07) : 671 - 694
  • [37] Bavota G, 2013, 2013 7TH INTERNATIONAL WORKSHOP ON TRACEABILITY IN EMERGING FORMS OF SOFTWARE ENGINEERING (TEFSE), P83, DOI 10.1109/TEFSE.2013.6620160
  • [38] Beard M., 2011, 2011 18th Working Conference on Reverse Engineering, P124, DOI 10.1109/WCRE.2011.23
  • [39] Comparison and evaluation of clone detection tools
    Bellon, Stefan
    Koschke, Rainer
    Antoniol, Giuliano
    Krinke, Jens
    Merlo, Ettore
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (09) : 577 - 591
  • [40] Bettenburg N., 2010, Proceedings of the 17th Working Conference on Reverse Engineering, P277