ArabSis: Arabic Corpus Sentiment Analysis

被引:1
作者
Doughan, Ziad [1 ]
Itani, Sari [1 ]
Itani, Samir [2 ]
机构
[1] Beirut Arab Univ, Fac Engn, Dept Elect & Comp Engn, Beirut 11502, Lebanon
[2] Beirut Arab Univ, Fac Human Sci, Dept Arab Language & Literature, Beirut 11502, Lebanon
关键词
Sentiment analysis; Natural language processing; Machine learning; Data models; Analytical models; Computational modeling; Deep learning; Context modeling; Tokenization; Emotion recognition; Arabic NLP; artificial intelligence; ensemble methods; machine learning; natural language processing; sentiment analysis;
D O I
10.1109/ACCESS.2025.3567755
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite rapid progress in natural language processing (NLP), the development of specialized resources for niche domains-critical for specialized applications like affective computing and emotionally intelligent AI-remains a persistent challenge. While benchmark datasets abound for general tasks, languages like Arabic and fields like multi-dimensional sentiment analysis beyond binary classification as positive or negative suffer from resource scarcity, limiting progress in human-centric applications. To address this gap, we present ArabSis: a novel Arabic corpus for multi-dimensional sentiment analysis across five categorical emotions (Joy, Sadness, Fear, Liking, Hatred). Our work introduces a reproducible framework for creating specialized corpora in low-resource languages, enabling future research in regressive dimensional sentiment analysis and other specialized NLP applications. The ArabSis corpus, developed through systematic data augmentation and human labelling, facilitates advanced analysis using traditional NLP techniques (TF-IDF, Bag of Words) and modern deep learning approaches. It also targets the universal Arabic language whereas previous research focuses on Arabic regardless of the dialect which make small nuances and inconsistencies among dialects unnoticeable and unfixable. We evaluate machine learning (ML) and deep learning (DL) models in one-vs-all classification tasks, demonstrating that ML models (e.g., SVMs, Random Forests) outperform DL counterparts on smaller datasets. An ensemble method combining top-performing models achieves 98.6% accuracy through score averaging and majority voting systems, though revealing inherent biases in ensemble voting mechanisms. The study provides a comprehensive pipeline encompassing data preprocessing, exploratory analysis, and model training, validated through 5-fold cross-validation, establishing a blueprint for developing specialized NLP resources, particularly for under-resourced languages.
引用
收藏
页码:81083 / 81095
页数:13
相关论文
共 24 条
[1]  
Abdelfattah M.F., 2023, Bull. Electr. Eng. Informat., V12, P1196, DOI 10.11591/eei.v12i2.3914
[2]  
Alomari Khaled Mohammad, 2017, Advances in Artificial Intelligence: from Theory to Practice. 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2017. Proceedings: LNAI 10350, P602, DOI 10.1007/978-3-319-60042-0_66
[3]  
Aly M, 2013, Short Papers, V2, P4948
[4]  
[Anonymous], 2011, The elements of statistical learning: data mining, inference, and prediction, DOI 10.1007/978-0-387-84858-7
[5]  
Bishop C.M., 2006, Pattern recognition and machine learning, DOI 10.1007/978-0-387-45528-0
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Bojanowski P., 2017, T ASSOC COMPUT LING, V5, P135, DOI [DOI 10.1162/TACL_A_00051, DOI 10.1162/TACLA00051, 10.1162/tacla00051]
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297