Examining the significance of high-level programming features in source code author classification

被引:35
作者
Frantzeskou, Georgia [1 ]
MacDonell, Stephen [2 ]
Stamatatos, Efstathios [1 ]
Gritzalis, Stefanos [1 ]
机构
[1] Univ Aegean, Dept Informat & Commun Syst Engn, Samos 83200, Greece
[2] Auckland Univ Technol, Sch Comp & Math Sci, Auckland 1020, New Zealand
关键词
authorship; source code; program features; fraud;
D O I
10.1016/j.jss.2007.03.004
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The use of Source Code Author Profiles (SCAP) represents a new, highly accurate approach to source code authorship identification that is, unlike previous methods, language independent. While accuracy is clearly a crucial requirement of any author identification method, in cases of litigation regarding authorship, plagiarism, and so on, there is also a need to know why it is claimed that a piece of code is written by a particular author. What is it about that piece of code that suggests a particular author? What features in the code make one author more likely than another? In this study, we describe a means of identifying the high-level features that contribute to source code authorship identification using as a tool the SCAP method. A variety of features are considered for Java and Common Lisp and the importance of each feature in determining authorship is measured through a sequence of experiments in which we remove one feature at a time. The results show that, for these programs, comments, layout features and package-related naming influence classification accuracy whereas user-defined naming, an obvious programmer related feature, does not appear to influence accuracy. A comparison is also made between the relative feature contributions in programs written in the two languages. (C) 2007 Elsevier Inc. All rights reserved.
引用
收藏
页码:447 / 460
页数:14
相关论文
共 36 条
[1]  
[Anonymous], 2005, INT J DIGITAL EVIDEN
[2]   AN EMPIRICAL-STUDY OF COBOL PROGRAMS VIA A STYLE ANALYZER - THE BENEFITS OF GOOD PROGRAMMING STYLE [J].
BENANDER, AC ;
BENANDER, BA .
JOURNAL OF SYSTEMS AND SOFTWARE, 1989, 10 (04) :271-279
[3]  
CHASKI CE, 1997, NATL I JUSTICE J, P15
[4]   Extraction of Java']Java program fingerprints for software authorship identification [J].
Ding, HB ;
Samadzadeh, MH .
JOURNAL OF SYSTEMS AND SOFTWARE, 2004, 72 (01) :49-57
[5]  
FLOYD RW, 1994, LANGUAGE MACHINES
[6]  
FRANTZESKOU G, 2006, P 28 INT C SOFTW ENG
[7]  
FRANTZESKOU G, 2004, ICETE04, V2, P85
[8]  
FRANTZESKOU G, 2005, P ICETE 2005 INT C E
[9]   IDENTIFIED (integrated dictionary-based extraction of non-language-dependent token information for forensic identification, examination, and discrimination): A dictionary-based system for extracting source code metrics for software forensics [J].
Gray, A ;
Sallis, P ;
MacDonell, S .
1998 INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: EDUCATION & PRACTICE, PROCEEDINGS, 1998, :252-259
[10]  
*HARL GROUP LTD, 1996, COMM LISP SPEC