A survey on the use of topic models when mining software repositories

被引:0
作者
Tse-Hsun Chen
Stephen W. Thomas
Ahmed E. Hassan
机构
[1] Queen’s University,Software Analysis and Intelligence Lab (SAIL)
来源
Empirical Software Engineering | 2016年 / 21卷
关键词
Topic modeling; LDA; LSI; Survey;
D O I
暂无
中图分类号
学科分类号
摘要
Researchers in software engineering have attempted to improve software development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.
引用
收藏
页码:1843 / 1919
页数:76
相关论文
共 170 条
[1]  
Abebe SL(2013)Supporting concept location through identifier parsing and ontology extraction J Syst Softw 86 2919-2938
[2]  
Alicante A(2010)Topic models vs. unstructured data Commun ACM 53 16-18
[3]  
Corazza A(2008)A theory of aspects as latent topics ACM SIGPLAN Not 43 543-562
[4]  
Tonella P(2003)Matching words and pictures J Mach Learn Res 3 1107-1135
[5]  
Anthes G(2014)Methodbook: recommending move method refactorings via relational topic models IEEE Trans Softw Eng 40 671-694
[6]  
Baldi PF(2014)Configuring latent dirichlet allocation based feature location Empir Softw Eng 19 465-500
[7]  
Lopes CV(2007)A correlated topic model of science Ann Appl Stat 1 17-35
[8]  
Linstead EJ(2008)Supervised topic models Adv Neural Inf Proc Syst 20 121-128
[9]  
Bajracharya SK(2003)Latent Dirichlet allocation J Mach Learn Res 3 993-1022
[10]  
Barnard K(2004)Hierarchical topic models and the nested Chinese restaurant process Adv Neural Inf Proc Syst 16 106-30