Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair

被引:0
作者
Muneer, Iqra [1 ]
Waheed, Nida [2 ]
Ashraf, Adnan [3 ]
Nawab, Rao M. Adeel [2 ]
机构
[1] Univ Engn & Technol Lahore, Dept Comp Sci & Engn, Narowal Campus, Lahore, Punjab, Pakistan
[2] COMSATS Univ Islamabad, Dept CS & IT, Lahore, Punjab, Pakistan
[3] Ara Inst Canterbury, Dept Software Engn, Christchurch, Canterbury, New Zealand
来源
EUROPEAN JOURNAL ON ARTIFICIAL INTELLIGENCE | 2025年 / 38卷 / 03期
关键词
cross-lingual paraphrase detection; cross-lingual sentence transformer; English-Urdu word pairs; PLAGIARISM DETECTION;
D O I
10.1177/30504554251319446
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to vast digital data collections and paraphrasing tools, researchers have shown growing interest in Cross-lingual Paraphrase Detection (CLPD). Open-access data and tools make paraphrasing easier and detection more challenging. Translation tools further exacerbate the issue by enabling effortless text translation across languages, leading to increased cross-lingual paraphrasing. Most existing CLPD studies focus on European languages, particularly English, while the English-Urdu language pair remains underexplored due to limited standard approaches and benchmark corpora.This study addresses this gap by developing the CLPD Corpus for English-Urdu (CLPD-EU), a gold-standard benchmark corpus at the sentence level. The corpus includes 5,801 sentence pairs, comprising 3,900 paraphrased and 1,901 non-paraphrased instances. Additionally, the study implements classical machine learning methods based on bilingual dictionaries, cross-lingual word embeddings, and transfer learning using sentence transformers.The research further incorporates state-of-the-art Large Language Models (LLMs) such as Mistral and LLaMA, significantly improving detection accuracy. Our proposed Feature Fusion Approach, 'Comb-ST+BD,' demonstrates strong performance with an F1 score of 0.739 for the CLPD task. The CLPD-EU corpus will be publicly available to encourage further research in CLPD, especially for under-resourced languages like Urdu.
引用
收藏
页码:309 / 329
页数:21
相关论文
共 78 条
[1]   Gain Customer Insights Using NLP Techniques [J].
Akella, Kanna ;
Venkatachalam, N. ;
Gokul, K. ;
Choi, Keunho ;
Tyakal, Ramachandraprabhu .
SAE International Journal of Materials and Manufacturing, 2017, 10 (03) :333-337
[2]   Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods [J].
Alzahrani, Salha M. ;
Salim, Naomie ;
Abraham, Ajith .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (02) :133-149
[3]  
[Anonymous], 2008, P 11 ANN RES C UK SP
[4]  
Asghari H., 2015, NOTEBOOK PAN CLEF, V1391, P1006
[5]  
Bakhteev O., 2019, WORKSH DOC INT NEURI
[6]  
Barron-Cedeno A., 2013, Proceedings of the Multilingual Information Access in South Asian Languages, P59
[7]   Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection [J].
Barron-Cedeno, Alberto ;
Vila, Marta ;
Antonia Marti, M. ;
Rosso, Paolo .
COMPUTATIONAL LINGUISTICS, 2013, 39 (04) :917-948
[8]  
Bowman SR., 2015, P 2015 C EMP METH NA, P632, DOI [DOI 10.18653/V1/D15-1075, 10.18653/v1/D15-1075]
[9]  
Cer D, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P169
[10]  
Ceska Z, 2008, LECT NOTES ARTIF INT, V5253, P83, DOI 10.1007/978-3-540-85776-1_8