Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

被引：3

作者：

Muneer, Iqra ^{[1
,2
]}

Nawab, Rao Muhammad Adeel ^{[1
]}

机构：

[1] COMSATS Univ Islamabad, Lahore Campus, Lahore, Pakistan

[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal, Pakistan

来源：

LANGUAGE RESOURCES AND EVALUATION | 2022年 / 56卷 / 04期

关键词：

Cross-lingual text reuse; English-Urdu language pair; Lexical; Syntactical; Phrasal; Cross-lingual word embedding; Cross-lingual semantic tagger; Cross-lingual sentence transformer; PLAGIARISM DETECTION;

D O I：

10.1007/s10579-022-09613-4

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier to reuse text across the languages and very difficult to detect it. In the previous studies, the problem of CLTRD for the English-Urdu language pair has been explored at the sentence/passage and document level, and benchmark corpora and methods have been developed. However, there is a lack of benchmark corpora and methods for the CLTRD for the English-Urdu language pair at the lexical, syntactical, and phrasal levels. To fulfill this research gap, this study presents three large benchmark corpora for detecting the Cross-Lingual Text Reuse (CLTR) at three levels of rewrite (Wholly Derived (WD), Partially Derived (PD), and Non Derived (ND)). The CLEU-Lex, CLEU-Syn and CLEU-Phr corpora contain 66,485 (WD = 22,236, PD = 20,315 and ND = 23,934), 60,267 (WD = 20,007, PD = 16,979 and ND = 23,281) and 60,106 (WD = 23,862, PD = 15,878 and ND = 20,366) CLTR pairs respectively. As a secondary major contribution, we have applied the Cross-Lingual Word Embedding (CLWE), Cross-Lingual Semantic Tagger (CLST), and Cross-Lingual Sentence Transformer (CLSTR) based methods on our three proposed corpora for the CLTRD. Our extensive experimentation showed that for the binary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer (F-1 = 0.80). For the CLEU-Syn and CLEU-Phr corpora, the best results were obtained using the cross-lingual sentence transformer and a combination of the CLWE, CLST and CLSTR methods (F-1 = 0.92 on CLEU-Syn and F-1 = 0.94 on CLEU-Phr). For the ternary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer method (F-1 = 0.69). For the CLEU-Syn corpus, the best results were obtained using a combination of the CLWE, CLST, and CLSTR methods (F-1 = 0.82). For the CLEU-Phr corpus the best results were obtained using cross-lingual sentence transformer and combination of CLWE, CLST, and CLSTR methods (F-1 = 0.78). To foster and promote research in Urdu (a low-resourced language) all the three proposed corpora are free and publicly available for research purposes.

引用

页码：1103 / 1130

页数：28

共 5 条

[1] Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
Iqra Muneer
Rao Muhammad Adeel Nawab
Language Resources and Evaluation, 2022, 56 : 1103 - 1130
[2] Cross-Lingual Text Reuse Detection at sentence level for English-Urdu language pair
Muneer, Iqra
Nawab, Rao Muhammad Adeel
COMPUTER SPEECH AND LANGUAGE, 2022, 75
[3] Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair
Sharjeel, Muhammad
Muneer, Iqra
Nosheen, Sumaira
Nawab, Rao Muhammad Adeel
Rayson, Paul
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[4] Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair
Muneer, Iqra
Nawab, Rao Muhammad Adeel
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
[5] Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair
Muneer, Iqra
Waheed, Nida
Ashraf, Adnan
Nawab, Rao M. Adeel
EUROPEAN JOURNAL ON ARTIFICIAL INTELLIGENCE, 2025,

← 1 →