Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

被引:3
|
作者
Muneer, Iqra [1 ,2 ]
Nawab, Rao Muhammad Adeel [1 ]
机构
[1] COMSATS Univ Islamabad, Lahore Campus, Lahore, Pakistan
[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal, Pakistan
关键词
Cross-lingual text reuse; English-Urdu language pair; Lexical; Syntactical; Phrasal; Cross-lingual word embedding; Cross-lingual semantic tagger; Cross-lingual sentence transformer; PLAGIARISM DETECTION;
D O I
10.1007/s10579-022-09613-4
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier to reuse text across the languages and very difficult to detect it. In the previous studies, the problem of CLTRD for the English-Urdu language pair has been explored at the sentence/passage and document level, and benchmark corpora and methods have been developed. However, there is a lack of benchmark corpora and methods for the CLTRD for the English-Urdu language pair at the lexical, syntactical, and phrasal levels. To fulfill this research gap, this study presents three large benchmark corpora for detecting the Cross-Lingual Text Reuse (CLTR) at three levels of rewrite (Wholly Derived (WD), Partially Derived (PD), and Non Derived (ND)). The CLEU-Lex, CLEU-Syn and CLEU-Phr corpora contain 66,485 (WD = 22,236, PD = 20,315 and ND = 23,934), 60,267 (WD = 20,007, PD = 16,979 and ND = 23,281) and 60,106 (WD = 23,862, PD = 15,878 and ND = 20,366) CLTR pairs respectively. As a secondary major contribution, we have applied the Cross-Lingual Word Embedding (CLWE), Cross-Lingual Semantic Tagger (CLST), and Cross-Lingual Sentence Transformer (CLSTR) based methods on our three proposed corpora for the CLTRD. Our extensive experimentation showed that for the binary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer (F-1 = 0.80). For the CLEU-Syn and CLEU-Phr corpora, the best results were obtained using the cross-lingual sentence transformer and a combination of the CLWE, CLST and CLSTR methods (F-1 = 0.92 on CLEU-Syn and F-1 = 0.94 on CLEU-Phr). For the ternary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer method (F-1 = 0.69). For the CLEU-Syn corpus, the best results were obtained using a combination of the CLWE, CLST, and CLSTR methods (F-1 = 0.82). For the CLEU-Phr corpus the best results were obtained using cross-lingual sentence transformer and combination of CLWE, CLST, and CLSTR methods (F-1 = 0.78). To foster and promote research in Urdu (a low-resourced language) all the three proposed corpora are free and publicly available for research purposes.
引用
收藏
页码:1103 / 1130
页数:28
相关论文
共 5 条
  • [1] Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
    Iqra Muneer
    Rao Muhammad Adeel Nawab
    Language Resources and Evaluation, 2022, 56 : 1103 - 1130
  • [2] Cross-Lingual Text Reuse Detection at sentence level for English-Urdu language pair
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    COMPUTER SPEECH AND LANGUAGE, 2022, 75
  • [3] Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair
    Sharjeel, Muhammad
    Muneer, Iqra
    Nosheen, Sumaira
    Nawab, Rao Muhammad Adeel
    Rayson, Paul
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [4] Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [5] Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair
    Muneer, Iqra
    Waheed, Nida
    Ashraf, Adnan
    Nawab, Rao M. Adeel
    EUROPEAN JOURNAL ON ARTIFICIAL INTELLIGENCE, 2025,