Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language

被引:21
作者
Mahmoud, Adnen [1 ,2 ]
Zrigui, Mounir [1 ]
机构
[1] Univ Monastir, Algebra Numbers Theory & Nonlinear Anal Lab LATNA, Monastir, Tunisia
[2] Univ Sousse, Higher Inst Comp Sci & Commun Tech, Hammam Sousse, Sousse, Tunisia
关键词
Arabic language; Paraphrase detection; Semantic similarity analysis; Sentence vector representation; Convolutional neural network; Natural language processing; SYSTEM;
D O I
10.1007/s13369-019-04039-7
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The continuous increase in extraordinary textual sources on the web has facilitated the act of paraphrase. Its detection has become a challenge in different natural language processing applications (e.g., plagiarism detection, information retrieval and extraction, question answering, etc.). Different from western languages like English, few works have been addressed the problem of extrinsic paraphrase detection in Arabic language. In this context, we proposed a deep learning-based approach to indicate how original and suspect documents expressed the same meaning. Indeed, word2vec algorithm extracted the relevant features by predicting each word to its neighbors. Subsequently, averaging the obtained vectors was efficient for generating sentence vectors representations. Then, convolutional neural network was useful to capture more contextual information and compute the degree of semantic relatedness. Faced to the lack of resources publicly available, paraphrased corpus was developed using skip gram model. It had better performance in replacing an original word by its most similar one that had the same grammatical class from a vocabulary. Finally, the proposed system achieved good results enhancing an efficient contextual relationship detection between Arabic documents in terms of precision (85%) and recall (86.8%) than previous studies.
引用
收藏
页码:9263 / 9274
页数:12
相关论文
共 44 条
[1]   Automatic categorization of Arabic articles based on their political orientation [J].
Abooraig, Raddad ;
Al-Zu'bi, Shadi ;
Kanan, Tarek ;
Hawashin, Bilal ;
Al Ayoub, Mahmoud ;
Hmeidi, Ismail .
DIGITAL INVESTIGATION, 2018, 25 :24-41
[2]   An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization [J].
Al-Sabahi, Kamal ;
Zhang, Zuping ;
Long, Jun ;
Alwesabi, Khaled .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2018, 43 (12) :8079-8094
[3]  
Al-Shenak M., 2019, Journal of Theoretical and Applied Information Technology, V97, P681
[4]  
Almarwani N., 2017, P 3 ARABIC NATURAL L, P185, DOI DOI 10.18653/V1/W17-1322
[5]  
Alrabiah M., 2014, International Journal of Computational Linguistics (IJCL), V5, P27
[6]  
AlZu'bi S, 2018, 2018 FIFTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), P323, DOI 10.1109/SNAMS.2018.8554909
[7]  
AlZu'bi S, 2018, 2018 FIFTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), P172, DOI 10.1109/SNAMS.2018.8554487
[8]   Sentence similarity based on semantic kernels for intelligent text retrieval [J].
Amir, Samir ;
Tanasescu, Adrian ;
Zighed, Djamel A. .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 48 (03) :675-689
[9]  
[Anonymous], ARXIV14112738CS
[10]  
Azunre P., 2019, ARXIV190108456, P1