Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems

被引:2
作者
Taerungruang, Supawat [1 ]
Aroonmanakun, Wirote [2 ]
机构
[1] Chulalongkorn Univ, Fac Arts, Linguist, Bangkok, Thailand
[2] Chulalongkorn Univ, Fac Arts, Dept Linguist, Bangkok, Thailand
来源
GEMA ONLINE JOURNAL OF LANGUAGE STUDIES | 2018年 / 18卷 / 03期
关键词
plagiarism; Thai plagiarism detection; corpus creation; language resources; natural language processing;
D O I
10.17576/gema-2018-1803-11
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Plagiarism is a major problem in the academic world. It does not only undermine the credibility of educational institutions, but also interrupts the processes of creating knowledge in the academic community. To lessen this problem, many plagiarism detection systems have been developed to detect plagiarized texts in academic works. In this paper, we describe the design and process in creating an academic Thai plagiarism corpus. This corpus is necessary for training and testing plagiarism detection systems for Thai. In order to make this corpus a comprehensive representation of plagiarism, the data has been divided into various types based on the degree of the linguistic mechanisms used in plagiarism. Data compiled in our corpus comes through two main methods: manually created by participants and automatically generated by a program. After the corpus is created, its validity is verified by using three measurements: a measurement of similarity between suspicious texts at the character level, a measurement of similarity between suspicious texts at the word level, and a comparison of different types of data compiled in the corpus based on the similarity measured. The results of the analyses indicate that the corpus created by the proposed methods is effective in training and testing plagiarism detection systems.
引用
收藏
页码:186 / 202
页数:17
相关论文
共 27 条
[1]  
Adam Angry Ronald, 2014, Journal of Theoretical and Applied Information Technology, V63, P168
[2]   Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods [J].
Alzahrani, Salha M. ;
Salim, Naomie ;
Abraham, Ajith .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (02) :133-149
[3]  
Asghari H., 2015, C LABS EV FOR CLEF 2
[4]   Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection [J].
Barron-Cedeno, Alberto ;
Vila, Marta ;
Antonia Marti, M. ;
Rosso, Paolo .
COMPUTATIONAL LINGUISTICS, 2013, 39 (04) :917-948
[5]  
Bretag T., 2009, J U TEACHING LEARNIN, V6, P49
[6]  
Cheema W. A., 2015, CLEF 2015 EV LABS WO
[7]  
Chulalongkorn University, 2012, ACAD PLAG ISS WE SHO
[8]   Developing a corpus of plagiarised short answers [J].
Clough, Paul ;
Stevenson, Mark .
LANGUAGE RESOURCES AND EVALUATION, 2011, 45 (01) :5-24
[9]   MEASURES OF THE AMOUNT OF ECOLOGIC ASSOCIATION BETWEEN SPECIES [J].
DICE, LR .
ECOLOGY, 1945, 26 (03) :297-302
[10]  
Eiselt M.P., 2009, SEPLN 2009 WORKSH UN, P1