BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification

被引:0
作者
Adnen Mahmoud
Mounir Zrigui
机构
[1] University of Monastir,Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS
[2] University of Sousse,Higher Institute of Computer Science and Communication Techniques ISITCom
来源
Arabian Journal for Science and Engineering | 2021年 / 46卷
关键词
Arabic language; Paraphrase detection; Semantic similarity analysis; Global word vector representation; Convolutional neural network; Bidirectional long short-term memory; Recurrent neural networks; Natural language processing;
D O I
暂无
中图分类号
学科分类号
摘要
Advances in communication technologies have enabled peoples to deliver more. Due to this phenomenon, an increasing amount of data are easily disseminated and published on the internet, which encouraged the practice of paraphrasing. It allows the original sentence to be concealed by alternative expressions of the same meaning. Its detection consists in identifying the degree of semantic similarity between them. It is one of the complex tasks of automatic natural language processing and artificial intelligence. Despite the fact that Arabic language is spoken by a large population around the world, it is rich of grammars and semantics that made hard its sentences modeling and similarity computing. In this paper, an Arabic extrinsic paraphrase identification method is proposed. It is based on a Siamese recurrent neural networks architecture seeing its performance in processing variable size of textual sequences. Indeed, pertinent features are firstly extracted using global word vector that used a global co-occurrence matrix based on a local context window. Then, bidirectional long short-term memory is introduced that incorporated efficiently long-term dependent relationships and captured meaningful contextual semantics between words. For paraphrase identification, cosine measure is used as a merge function. It was useful for identifying semantic similarity between the obtained source and suspect vectors. To address the lack of free and publicly Arabic paraphrased datasets, word2vec algorithm and part-of-speech tagging are combined to generate suspect sentences. For its validation, its quality is compared to the SemEval benchmark. Experiments demonstrated the effectiveness of our proposal’s methods.
引用
收藏
页码:4163 / 4174
页数:11
相关论文
共 57 条
[1]  
Altheneyan A(2020)Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection Int. J. Pattern Recognit Artif Intell. 34 1-31
[2]  
Menai MEB(2018)Using tweets and emojis to build TEAD: an arabic dataset for sentiment analysis Computación y Sistemas 22 777-786
[3]  
Abdellaoui H(2017)Constructing a lexicon of Arabic-English named entity using SMT and semantic linked data Int. Arab J. Inf. Technol. 14 820-825
[4]  
Zrigui M(2018)Arabic discourse analysis based on acoustic, prosodic and phonetic modeling: elocution evaluation, speech classification and pathological speech correction Int. J. Speech Technol. 21 1071-1090
[5]  
Hkiri E(2015)Exploring the potential of schemes in building NLP tools for Arabic language Int. Arab J. Inf. Technol. (IAJIT) 12 566-573
[6]  
Mallat S(2018)Feature selection and enhanced krill herd algorithm for text clustering Stud. Comput. Intell. 1196 1-8
[7]  
Zrigui M(2019)Measuring performance of n-gram and Jaccard-similarity metrics in document plagiarism application J. Phys. 25 456-466
[8]  
Mars M(2018)A new feature selection method to improve the document clustering using particle swarm optimization algorithm J. Comput. Sci. 48 4047-4071
[9]  
Maraoui M(2018)Hybrid clustering analysis using improved krill herd algorithm Appl. Intell. 5 111-114
[10]  
Terbeh N(2016)Plagiarism detection using artificial intelligence technique in multiple files Int. J. Sci. Technol. Res. 48 162-177