Reuse and plagiarism in Speech and Natural Language Processing publications

被引:5
作者
Mariani, Joseph [1 ]
Francopoulo, Gil [1 ,2 ]
Paroubek, Patrick [1 ]
机构
[1] Univ Paris Saclay, CNRS, LIMSI, Orsay, France
[2] Tagmatica, Paris, France
关键词
Plagiarism; Detection; Text reuse; Natural Language Processing; Speech Processing; Scientometrics; Informetrics;
D O I
10.1007/s00799-017-0211-0
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.
引用
收藏
页码:113 / 126
页数:14
相关论文
共 35 条
[1]  
Barron-Cedeno A., 2010, P 7 INT C LANG RES E, P771
[2]   Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection [J].
Barron-Cedeno, Alberto ;
Vila, Marta ;
Antonia Marti, M. ;
Rosso, Paolo .
COMPUTATIONAL LINGUISTICS, 2013, 39 (04) :917-948
[3]  
Bensalem I., 2014, EMNLP 2014 2014 C EM, P1459, DOI [10.3115/v1/d14-1153, DOI 10.3115/V1/D14-1153]
[4]  
Bird S, 2008, SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, P1755
[5]  
Calzolari N, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1084
[6]  
Ceska Z., 2009, P INT C RANLP 2009 A, P55
[7]  
Chong M., 2011, P REC ADV NAT LANG P, P704
[8]   Patterns of text reuse in a scientific corpus [J].
Citron, Daniel T. ;
Ginsparg, Paul .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2015, 112 (01) :25-30
[9]  
Clough P, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P152
[10]   Developing a corpus of plagiarised short answers [J].
Clough, Paul ;
Stevenson, Mark .
LANGUAGE RESOURCES AND EVALUATION, 2011, 45 (01) :5-24