Sherlock N-overlap: Invasive Normalization and Overlap Coefficient for the Similarity Analysis Between Source Code

被引:20
作者
Allyson, Franca B. [1 ]
Danilo, Maciel L. [2 ]
Jose, Soares M. [2 ]
Giovanni, Barroso C. [3 ]
机构
[1] Inst Fed Educ Ciencia & Tecnol Ceara, Dept Teleinformat, Caninde, CE, Brazil
[2] Univ Fed Ceara, Dept Teleinformat Engn, Fortaleza, CE, Brazil
[3] Univ Fed Ceara, Dept Phys, Fortaleza, CE, Brazil
关键词
Source code similarity detection; similarity investigation tool; data preprocessing; normalization; method of conformity; PLAGIARISM;
D O I
10.1109/TC.2018.2881449
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Some tools for detecting similarity, such as Sherlock, compare textual documents of any nature, but have limitations to compare source code files. The presence or absence of blank spaces between structure elements, variable names, among other actions interfere with the similarity index found. This paper evidences that the preprocessing of the source code improves Sherlock performance. The results are based on experiments conducted with 66 source code previously plagiarized, and a base formed by 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was created to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. Our approach, called Sherlock N-overlap obtained, in most of the cases tested, similarity indexes superior to other complex tools such as MOSS, JPlag and SIM.
引用
收藏
页码:740 / 751
页数:12
相关论文
共 41 条
[1]  
Ahtiainen Aleksi, 2006, P 6 BALT SEA C COMP, P141, DOI DOI 10.1145/1315803.1315831
[2]  
[Anonymous], P INT C INF COMM TEC
[3]  
[Anonymous], 2010, COMP PLAGIARISM DETE
[4]  
[Anonymous], THESIS
[5]  
[Anonymous], SHERLOCK PLAGIARISM
[6]  
[Anonymous], THESIS
[7]  
[Anonymous], P 37 ANN FRONT ED C
[8]  
[Anonymous], REV BRAS INFORM ED
[9]  
[Anonymous], P 5 INT PLAG C NEWC
[10]  
[Anonymous], SOFTWARE TEXT SIMILA