Efficient plagiarism detection for large code repositories

被引:72
作者
Burrows, Steven [1 ]
Tahaghoghi, S. M. M. [1 ]
Zobel, Justin [1 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia
关键词
plagiarism detection; program code similarity; indexing; local alignment;
D O I
10.1002/spe.750
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Unauthorized re-use of code by students is a widespread problem in academic institutions, and raises liability issues for industry. Manual plagiarism detection is time-consuming, and current effective plagiarism detection approaches cannot be easily scaled to very large code repositories. While there are practical text-based plagiarism detection systems capable of working with large collections, this is not the case for code-based plagiarism detection. In this paper, we propose techniques for detecting plagiarism in program code using text similarity measures and local alignment. Through detailed empirical evaluation on small and large collections of programs, we show that our approach is highly scalable while maintaining similar levels of effectiveness to that of the popular JPlag and MOSS systems. Copyright (c) 2006 John Wiley & Sons, Ltd.
引用
收藏
页码:151 / 175
页数:25
相关论文
共 35 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
[Anonymous], ACM SIGIR FORUM, DOI [10.1145/281250.281253, DOI 10.1145/281250.281253]
[3]  
ARWIN C, 2006, C RES PRACTICE INFOR, V48, P277
[4]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[5]  
Baker BS, 1998, PROCEEDINGS OF THE USENIX 1998 ANNUAL TECHNICAL CONFERENCE, P179
[6]  
BAKER BS, 1995, SECOND WORKING CONFERENCE ON REVERSE ENGINEERING, PROCEEDINGS, P86, DOI 10.1109/WCRE.1995.514697
[7]  
BOWYER K, 1999, 29 ANN FRONT ED C FI, P18
[8]  
Broder A. Z., 1997, P 6 INT WORLD WID WE, V29, P1157, DOI [DOI 10.1016/S0169-7552(97)00031-7, 10.1016/S0169-7552(97)00031-7]
[9]   On the resemblance and containment of documents [J].
Broder, AZ .
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, :21-29
[10]  
CHAWLA M, 2003, THESIS RMIT U MELBOU