Arabic text classification using deep learning models

被引：153

作者：

Elnagar, Ashraf ^{[1
]}

Al-Debsi, Ridhwan ^{[1
]}

Einea, Omar ^{[1
]}

机构：

[1] Univ Sharjah, Dept Comp Sci, Machine Learning & NLP Res Grp, Sharjah, U Arab Emirates

来源：

INFORMATION PROCESSING & MANAGEMENT | 2020年 / 57卷 / 01期

关键词：

Arabic text classification/categorization; Single-label text categorization; Multi-label text categorization; Word embedding; Deep learning; SANAD; NADiA; SENTIMENT ANALYSIS; CATEGORIZATION; IDENTIFICATION; PERFORMANCE; MACHINE; FUTURE;

D O I：

10.1016/j.ipm.2019.102121

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on "Masrawy" dataset.

引用

页数：17

共 75 条

[1]

Abdelhade N., 2017, INT C ADV INT SYST I, P232

[2] EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks [J].