Design and Development of a Plagiarism Corpus in Thai for Plagiarism Detection

被引:0
作者
Thaiprayoon, Santipong [1 ]
Palingoon, Pornpimon [1 ]
Trakultaweekoon, Kanokorn [1 ]
机构
[1] Natl Sci & Technol Dev Agcy NSTDA, Natl Elect & Comp Technol Ctr NECTEC, Pathum Thani, Thailand
来源
PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019) | 2019年
关键词
Thai plagiarism corpus; corpus construction; plagiarism detection; obfuscation strategies; natural language processing;
D O I
10.1109/kse.2019.8919436
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the main problems of creating a plagiarism corpus in Thai is that it is quite a difficult task to acquire the plagiarized documents with real cases due to the copyright issue. To solve the problem, we present a design and development of a Thai plagiarism corpus to evaluate and compare plagiarism detection algorithms for Thai. The corpus is developed by using the simulated plagiarism method based on Thai Wikipedia articles and web page articles. For this method, we provide a Thai plagiarism annotation tool and a Thai plagiarism guideline for assisting human annotators to plagiarize text passages. Our corpus contains simulated cases of plagiarized documents based on four classes of Thai plagiarism and linguistic mechanisms including copy-based change, lexicon-based change, structure-based change, and semantic-based change. We show that the suspicious documents in the corpus are manually created by using different obfuscation strategies, which make the suspicious documents more realistic and challenging. We then believe that the corpus developed in this paper will be a valuable contribution in the development, comparison, and evaluation of plagiarism detection algorithms. Moreover, our corpus is free and publicly available for research purposes.
引用
收藏
页码:376 / 380
页数:5
相关论文
共 10 条
[1]  
BarronCedeno A., 2013, ASS COMPUTATIONAL LI
[2]  
BarronCedeno A., 2010, P 7 C INT LANG RES E
[3]  
Clough P, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P152
[4]   Developing a corpus of plagiarised short answers [J].
Clough, Paul ;
Stevenson, Mark .
LANGUAGE RESOURCES AND EVALUATION, 2011, 45 (01) :5-24
[5]  
Mohtaj S., 2015, CLEF 2015
[6]  
Potthast M., 2010, CLEF 2010
[7]  
Potthast M., 2010, Coling 2010: Posters, P997
[8]  
Potthast M., 2009, P SEPLN 2009 WORKSH
[9]  
Sharjeel M., 2016, LANG RES EV C
[10]  
Taerungruang S., 2018, GEMA ONLINE J LANGUA, V18