BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification

被引:0
作者
Adnen Mahmoud
Mounir Zrigui
机构
[1] University of Monastir,Research Laboratory in Algebra, Numbers Theory and Intelligent Systems RLANTIS
[2] University of Sousse,Higher Institute of Computer Science and Communication Techniques ISITCom
来源
Arabian Journal for Science and Engineering | 2021年 / 46卷
关键词
Arabic language; Paraphrase detection; Semantic similarity analysis; Global word vector representation; Convolutional neural network; Bidirectional long short-term memory; Recurrent neural networks; Natural language processing;
D O I
暂无
中图分类号
学科分类号
摘要
Advances in communication technologies have enabled peoples to deliver more. Due to this phenomenon, an increasing amount of data are easily disseminated and published on the internet, which encouraged the practice of paraphrasing. It allows the original sentence to be concealed by alternative expressions of the same meaning. Its detection consists in identifying the degree of semantic similarity between them. It is one of the complex tasks of automatic natural language processing and artificial intelligence. Despite the fact that Arabic language is spoken by a large population around the world, it is rich of grammars and semantics that made hard its sentences modeling and similarity computing. In this paper, an Arabic extrinsic paraphrase identification method is proposed. It is based on a Siamese recurrent neural networks architecture seeing its performance in processing variable size of textual sequences. Indeed, pertinent features are firstly extracted using global word vector that used a global co-occurrence matrix based on a local context window. Then, bidirectional long short-term memory is introduced that incorporated efficiently long-term dependent relationships and captured meaningful contextual semantics between words. For paraphrase identification, cosine measure is used as a merge function. It was useful for identifying semantic similarity between the obtained source and suspect vectors. To address the lack of free and publicly Arabic paraphrased datasets, word2vec algorithm and part-of-speech tagging are combined to generate suspect sentences. For its validation, its quality is compared to the SemEval benchmark. Experiments demonstrated the effectiveness of our proposal’s methods.
引用
收藏
页码:4163 / 4174
页数:11
相关论文
共 57 条
[31]  
Asif M(undefined)undefined undefined undefined undefined-undefined
[32]  
Daud A(undefined)undefined undefined undefined undefined-undefined
[33]  
Khan JA(undefined)undefined undefined undefined undefined-undefined
[34]  
Nasir JA(undefined)undefined undefined undefined undefined-undefined
[35]  
Abbasi R(undefined)undefined undefined undefined undefined-undefined
[36]  
Farouk M(undefined)undefined undefined undefined undefined-undefined
[37]  
Song Y(undefined)undefined undefined undefined undefined-undefined
[38]  
Hu QV(undefined)undefined undefined undefined undefined-undefined
[39]  
He L(undefined)undefined undefined undefined undefined-undefined
[40]  
Liu G(undefined)undefined undefined undefined undefined-undefined