IRISH: A Hidden Markov Model to detect coded information islands in free text

被引:4
作者
Cerulo, Luigi [1 ,3 ]
Di Penta, Massimiliano [2 ]
Bacchelli, Alberto [4 ]
Ceccarelli, Michele [1 ,5 ]
Canfora, Gerardo [2 ]
机构
[1] Univ Sannio, Dept Sci & Technol, Benevento, Italy
[2] Univ Sannio, Dept Engn, Benevento, Italy
[3] Inst Genet Res Gaetano Salvatore, BioGeM, Ariano Irpino, AV, Italy
[4] Delft Univ Technol, Dept Software Technol, NL-2600 AA Delft, Netherlands
[5] QCRI Qatar Comp Res Inst, Doha, Qatar
关键词
Hidden Markov Models; Mining unstructured data; Developers' communication;
D O I
10.1016/j.scico.2014.11.017
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Developers' communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers' communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. We conduct an extensive evaluation of IRISH (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of IRISH between 74% and 99%; this is in line with existing approaches which, differently from IRISH, require specific expertise for the definition of regular expressions or grammars. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:26 / 43
页数:18
相关论文
共 34 条
  • [1] [Anonymous], 2005, P 11 ACM SIGKDD INT
  • [2] [Anonymous], 2010, Encyclopedia of Software Engineering
  • [3] [Anonymous], 1999, Modern Information Retrieval
  • [4] Recovering traceability links between code and documentation
    Antoniol, G
    Canfora, G
    Casazza, G
    De Lucia, A
    Merlo, E
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2002, 28 (10) : 970 - 983
  • [5] Anvik J., 2006, P 28 INT C SOFTWARE, P361, DOI DOI 10.1145/1134285.1134336
  • [6] Bacchelli A., 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering, P476, DOI 10.1109/ASE.2011.6100103
  • [7] Bacchelli A, 2012, PROC INT CONF SOFTW, P375, DOI 10.1109/ICSE.2012.6227177
  • [8] RTFM (Read The Factual Mails) - Augmenting Program Comprehension with Remail
    Bacchelli, Alberto
    Lanza, Michele
    Humpa, Vitezslav
    [J]. 2011 15TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING (CSMR), 2011, : 15 - 24
  • [9] Basili V.R., 1994, Encyclopedia of Software Engineering, P528532
  • [10] A MAXIMIZATION TECHNIQUE OCCURRING IN STATISTICAL ANALYSIS OF PROBABILISTIC FUNCTIONS OF MARKOV CHAINS
    BAUM, LE
    PETRIE, T
    SOULES, G
    WEISS, N
    [J]. ANNALS OF MATHEMATICAL STATISTICS, 1970, 41 (01): : 164 - &