Investigating the use of Lexical Information for Software System Clustering

被引:66
作者
Corazza, Anna [1 ]
Di Martino, Sergio [1 ]
Maggio, Valerio [1 ]
Scanniello, Giuseppe [2 ]
机构
[1] Univ Naples Federico II, Sez Informat, Dipartimento Sci Fis, Naples, Italy
[2] Univ Basilicata, Potenza, Italy
来源
2011 15TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING (CSMR) | 2011年
关键词
Software Remodularization; Clustering; Lexical Information; Probabilistic Model;
D O I
10.1109/CSMR.2011.8
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Developers have a lot of freedom in writing comments as well as in choosing identifiers and method names. These are intentional in nature and provide a different relevance of information to understand what a software system implements, and in particular the role of each source file. In this paper we investigate the effectiveness of exploiting lexical information for software system clustering. In particular we explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Their relevance has been weighted by means of a probabilistic model, whose parameters have been estimated by the Expectation-Maximization algorithm. To group source files accordingly we used a hierarchical clustering algorithm. The investigation has been conducted on a dataset of 13 open source Java software systems.
引用
收藏
页码:35 / 44
页数:10
相关论文
共 38 条
[1]   Analyzing the Evolution of the Source Code Vocabulary [J].
Abebe, Surafel Lemma ;
Haiduc, Sonia ;
Marcus, Andrian ;
Tonella, Paolo ;
Antoniol, Giuliano .
13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR 2009, PROCEEDINGS, 2009, :189-198
[2]   Clustering large software systems at multiple layers [J].
Andreopoulos, Bill ;
An, Aijun ;
Tzerpos, Vassillos ;
Wang, Xiaogang .
INFORMATION AND SOFTWARE TECHNOLOGY, 2007, 49 (03) :244-254
[3]   Information-theoretic software clustering [J].
Andritsos, P ;
Tzerpos, V .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2005, 31 (02) :150-165
[4]  
[Anonymous], THESIS U STUTTGART
[5]  
Anquetil N., 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303), P235, DOI 10.1109/WCRE.1999.806964
[6]   Comparison of Graph Clustering Algorithms for Recovering Software Architecture Module Views [J].
Bittencourt, Roberto Almeida ;
Guerrero, Dalton Dario Serey .
13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR 2009, PROCEEDINGS, 2009, :251-254
[7]  
Bowman I. T., 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002), P555, DOI 10.1109/ICSE.1999.841045
[8]   A Probabilistic based Approach towards Software System Clustering [J].
Corazza, Anna ;
Di Martino, Sergio ;
Scanniello, Giuseppe .
14TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING (CSMR 2010), 2010, :88-96
[9]   Identifying similar pages in Web applications using a competitive clustering algorithm [J].
De Lucia, Andrea ;
Scanniello, Giuseppe ;
Tortora, Genoveffa .
JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION-RESEARCH AND PRACTICE, 2007, 19 (05) :281-296
[10]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO