Methods for cross-language plagiarism detection

被引:48
|
作者
Barron-Cedeno, Alberto [1 ,2 ]
Gupta, Parth [3 ]
Rosso, Paolo [3 ]
机构
[1] Univ Politecn Cataluna, Talp Res Ctr, E-08028 Barcelona, Spain
[2] Univ Politecn Madrid, Fac Informat, E-28040 Madrid, Spain
[3] Univ Politecn Valencia, NLE Lab ELiRF, Valencia, Spain
关键词
Automatic plagiarism detection; Cross-language plagiarism; Plagiarism detection architecture; Cross-language similarity; Text re-use analysis; RETRIEVAL;
D O I
10.1016/j.knosys.2013.06.018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired. Crown Copyright (C) 2013 Published by Elsevier B.V. All rights reserved.
引用
收藏
页码:211 / 217
页数:7
相关论文
共 50 条
  • [31] Natural Language Processing methods and systems for biomedical ontology learning
    Liu, Kaihong
    Hogan, William R.
    Crowley, Rebecca S.
    JOURNAL OF BIOMEDICAL INFORMATICS, 2011, 44 (01) : 163 - 179
  • [32] Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval
    Yu, Tan
    Yang, Yi
    Fei, Hongliang
    Li, Yi
    Chen, Xiaodong
    Li, Ping
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 2444 - 2454
  • [33] Performance Comparison of Passage Retrieval Models according to Korean Language Tokenization Methods
    Seo, WonJune
    Sakthivel, V
    Min, Dugki
    Lee, Jae-Woo
    Choi, Enum
    2023 15TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE, ICACI, 2023,
  • [34] Exploring the further integration of machine translation in English-Chinese cross language information access
    Wu, Dan
    He, Daqing
    PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEMS, 2012, 46 (04) : 429 - 457
  • [35] Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering
    Skorkovska, Lucie
    TEXT, SPEECH AND DIALOGUE, TSD 2012, 2012, 7499 : 191 - 198
  • [36] Multiscale hypothesis testing theory and methods for aerosol and cloud layer detection of lidar
    Mao, Feiyue
    Luo, Xi
    Xu, Weiwei
    Gong, Wei
    REMOTE SENSING OF ENVIRONMENT, 2024, 300
  • [37] Language independent tokenization vs. stemming in automated detection of health websites' HONcode conformity: An Evaluation
    Boyer, Celia
    Dolamic, Ljiljana
    Falquet, Gilles
    CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS/INTERNATIONAL CONFERENCE ON PROJECT MANAGEMENT/CONFERENCE ON HEALTH AND SOCIAL CARE INFORMATION SYSTEMS AND TECHNOLOGIES, CENTERIS/PROJMAN / HCIST 2015, 2015, 64 : 224 - 231
  • [38] Cross-stage feature fusion and efficient self-attention for salient object detection
    Xia, Xiaofeng
    Ma, Yingdong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 104
  • [39] PHONETIC UNIT SELECTION FOR CROSS-LINGUAL QUERY-BY-EXAMPLE SPOKEN TERM DETECTION
    Lopez-Otero, Paula
    Docio-Fernandez, Laura
    Garcia-Mateo, Carmen
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 223 - 229
  • [40] Stability of cloud detection methods for Land Surface Temperature (LST) Climate Data Records (CDRs)
    Bulgin, Claire E.
    Maidment, Ross I.
    Ghent, Darren
    Merchant, Christopher J.
    REMOTE SENSING OF ENVIRONMENT, 2024, 315