Arabic text classification using deep learning models

被引:153
作者
Elnagar, Ashraf [1 ]
Al-Debsi, Ridhwan [1 ]
Einea, Omar [1 ]
机构
[1] Univ Sharjah, Dept Comp Sci, Machine Learning & NLP Res Grp, Sharjah, U Arab Emirates
关键词
Arabic text classification/categorization; Single-label text categorization; Multi-label text categorization; Word embedding; Deep learning; SANAD; NADiA; SENTIMENT ANALYSIS; CATEGORIZATION; IDENTIFICATION; PERFORMANCE; MACHINE; FUTURE;
D O I
10.1016/j.ipm.2019.102121
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on "Masrawy" dataset.
引用
收藏
页数:17
相关论文
共 75 条
[1]  
Abdelhade N., 2017, INT C ADV INT SYST I, P232
[2]   EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks [J].
Abdul-Mageed, Muhammad ;
Ungar, Lyle .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :718-728
[3]  
Abdullah M., 2018, P 12 INT WORKSH SEM, P350
[4]   A deep network model for paraphrase detection in short text messages [J].
Agarwal, Basant ;
Ramampiaro, Heri ;
Langseth, Helge ;
Ruocco, Massimiliano .
INFORMATION PROCESSING & MANAGEMENT, 2018, 54 (06) :922-937
[5]  
Aggarwal C. C., 2012, Mining Text Data
[6]  
Ahmed NA, 2015, 2015 6TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), P212, DOI 10.1109/IACS.2015.7103229
[7]  
Al Sallab A., 2015, Deep learning models for sentiment analysis in Arabic, P9
[8]   Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach [J].
Al-Anzi, Fawaz S. ;
AbuZeina, Dia .
INFORMATION PROCESSING & MANAGEMENT, 2018, 54 (01) :105-115
[9]   A comprehensive survey of arabic sentiment analysis [J].
Al-Ayyoub, Mahmoud ;
Khamaiseh, Abed Allah ;
Jararweh, Yaser ;
Al-Kabi, Mohammed N. .
INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (02) :320-342
[10]   Deep learning for Arabic NLP: A survey [J].
Al-Ayyoub, Mahmoud ;
Nuseir, Aya ;
Alsmearat, Kholoud ;
Jararweh, Yaser ;
Gupta, Brij .
JOURNAL OF COMPUTATIONAL SCIENCE, 2018, 26 :522-531