Configuring latent Dirichlet allocation based feature location

被引:58
作者
Biggers, Lauren R. [1 ]
Bocovich, Cecylia [2 ]
Capshaw, Riley [3 ]
Eddy, Brian P. [1 ]
Etzkorn, Letha H. [4 ]
Kraft, Nicholas A. [1 ]
机构
[1] Univ Alabama, Dept Comp Sci, Tuscaloosa, AL 35487 USA
[2] Macalester Coll, Dept Math Stat & Comp Sci, St Paul, MN 55105 USA
[3] Hendrix Coll, Dept Math & Comp Sci, Conway, AR USA
[4] Univ Alabama, Dept Comp Sci, Huntsville, AL 35899 USA
基金
美国国家科学基金会;
关键词
Software evolution; Program comprehension; Feature location; Static analysis; Text retrieval; CODE; RETRIEVAL; COHESION;
D O I
10.1007/s10664-012-9224-x
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.
引用
收藏
页码:465 / 500
页数:36
相关论文
共 58 条
  • [1] A traceability technique for specifications
    Abadi, Aharcin
    Nisenson, Mordechai
    Simionovici, Yahalomit
    [J]. PROCEEDINGS OF THE 16TH IEEE INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, 2008, : 103 - 112
  • [2] Lexicon Bad Smells in Software
    Abebe, Surafel Lemma
    Haiduc, Sonia
    Tonella, Paolo
    Marcus, Andrian
    [J]. 16TH WORKING CONFERENCE ON REVERSE ENGINEERING (WCRE 2009), 2009, : 95 - +
  • [3] Analyzing the Evolution of the Source Code Vocabulary
    Abebe, Surafel Lemma
    Haiduc, Sonia
    Marcus, Andrian
    Tonella, Paolo
    Antoniol, Giuliano
    [J]. 13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR 2009, PROCEEDINGS, 2009, : 189 - 198
  • [4] An introduction to MCMC for machine learning
    Andrieu, C
    de Freitas, N
    Doucet, A
    Jordan, MI
    [J]. MACHINE LEARNING, 2003, 50 (1-2) : 5 - 43
  • [5] [Anonymous], 2010, IEEE INT C SOFTWARE
  • [6] [Anonymous], 2009, PARAMETER ESTIMATION
  • [7] [Anonymous], 2006, P 2006 INT WORKSHOP
  • [8] Recovering traceability links between code and documentation
    Antoniol, G
    Canfora, G
    Casazza, G
    De Lucia, A
    Merlo, E
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2002, 28 (10) : 970 - 983
  • [9] Asuncion A., 2009, C UNC ART INT UAI QU, P27, DOI DOI 10.1080/10807030390248483
  • [10] Asuncion H.U., 2010, P 32 INT C SOFTW ENG, P95