Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges

被引:30
作者
Vani, K. [1 ]
Gupta, Deepa [2 ]
机构
[1] Amrita Univ, Amrita Vishwa Vidyapeetham, Amrita Sch Engn, Dept Comp Sci & Engn, Bengaluru, India
[2] Amrita Univ, Amrita Vishwa Vidyapeetham, Amrita Sch Engn, Dept Math, Bengaluru, India
关键词
Natural language processing; Plagiarism detection; Syntactic-semantic; POS tagging; Chunking; Semantic role labelling;
D O I
10.1016/j.ipm.2018.01.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN(1) competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.
引用
收藏
页码:408 / 432
页数:25
相关论文
共 89 条
  • [1] PDLK: Plagiarism detection using linguistic knowledge
    Abdi, Asad
    Idris, Norisma
    Alguliyev, Rasim M.
    Aliguliyev, Ramiz M.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (22) : 8936 - 8946
  • [2] Abney S.P., 1991, Principle-Based Parsing Studies in Linguistics and Philosophy, P257, DOI [DOI 10.1007/978-94-011-3474-310, 10.1007/978-94-011-3474-3_10, 10.1007/978-94-011-3474-310]
  • [3] Ali A. M. E. T., 2011, Proceedings of the 2011 5th Asia Modelling Symposium on Mathematical Modelling and Computer Simulation (AMS 2011), P39, DOI 10.1109/AMS.2011.19
  • [4] Alvi F., 2014, P 6 INT WORKSH PAN 1
  • [5] Alzahrani S.M., 2010, P 2 INT WORKSH PAN 1
  • [6] Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model
    Alzahrani, Salha M.
    Salim, Naomie
    Palade, Vasile
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2015, 27 (03) : 248 - 268
  • [7] Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods
    Alzahrani, Salha M.
    Salim, Naomie
    Abraham, Ajith
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (02): : 133 - 149
  • [8] [Anonymous], 2014, Citation-based plagiarism detection
  • [9] [Anonymous], 2010, AB WORDNET
  • [10] [Anonymous], 2014, Open Journal of Modern Linguistics