From source code identifiers to natural language terms

被引:30
作者
Carvalho, Nuno Ramos [1 ]
Almeida, Jose Joao [1 ]
Henriques, Pedro Rangel [1 ]
Varanda, Maria Joao [2 ]
机构
[1] Univ Minho, Dept Informat, P-4710057 Braga, Portugal
[2] Polytech Inst Braganca, P-5300253 Braganca, Portugal
关键词
Program comprehension; Natural language processing; Identifier splitting; TRACEABILITY LINKS; PROGRAM; COMPREHENSION;
D O I
10.1016/j.jss.2014.10.013
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces LINGIJA::IDSPLITTER a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:117 / 128
页数:12
相关论文
共 37 条
[1]  
Abebe Surafel Lemma, 2010, Proceedings of the 18th IEEE International Conference on Program Comprehension (ICPC 2010), P156, DOI 10.1109/ICPC.2010.29
[2]  
[Anonymous], 1992, DIRECTIONS CORPUS LI
[3]  
[Anonymous], [No title captured]
[4]   Recovering traceability links between code and documentation [J].
Antoniol, G ;
Canfora, G ;
Casazza, G ;
De Lucia, A ;
Merlo, E .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2002, 28 (10) :970-983
[5]  
Butler S., 2011, 25 EUR C OBJ OR PROG
[6]  
Caprile B, 2000, PROC IEEE INT CONF S, P97, DOI 10.1109/ICSM.2000.883022
[7]  
Caprile C., 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303), P112, DOI 10.1109/WCRE.1999.806952
[8]  
Carvalho N.R., 2014, COMPUT SCI INF UNPUB
[9]   Open Source Software Documentation Mining for Quality Assessment [J].
Carvalho, Nuno Ramos ;
Simoes, Alberto ;
Almeida, Jose Joao .
ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, 2013, 206 :785-794
[10]   REVERSE ENGINEERING AND DESIGN RECOVERY - A TAXONOMY [J].
CHIKOFSKY, EJ ;
CROSS, JH .
IEEE SOFTWARE, 1990, 7 (01) :13-17