Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

被引:18
|
作者
Albalawi, Yahya [1 ,2 ,3 ]
Buckley, Jim [1 ,3 ]
Nikolov, Nikola S. [1 ,3 ]
机构
[1] Univ Limerick, Dept Comp Sci & Informat Syst, Limerick, Ireland
[2] Univ Taibah, Coll Arts & Sci, Dept Comp & Informat Sci, Al Ula, Saudi Arabia
[3] Univ Limerick, Irish Software Res Ctr, Limerick, Ireland
基金
爱尔兰科学基金会;
关键词
Deep learning; Health information; Pre-trained word embeddings; Social media; Machine learning; Natural language processing; Twitter; CONVOLUTIONAL NEURAL-NETWORK; SENTIMENT ANALYSIS; IMBALANCED DATA; TWITTER; CLASSIFICATION; COMMUNICATION; ENSEMBLE;
D O I
10.1186/s40537-021-00488-w
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F-1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F-1 score of 75.2% and accuracy of 90.7% compared to F-1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.
引用
收藏
页数:29
相关论文
共 44 条
  • [1] Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
    Yahya Albalawi
    Jim Buckley
    Nikola S. Nikolov
    Journal of Big Data, 8
  • [2] A Comparative Study of Pre-trained Word Embeddings for Arabic Sentiment Analysis
    Zouidine, Mohamed
    Khalil, Mohammed
    2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 1243 - 1248
  • [3] Arabic Fake News Detection in Social Media Context Using Word Embeddings and Pre-trained Transformers
    Azzeh, Mohammad
    Qusef, Abdallah
    Alabboushi, Omar
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2025, 50 (02) : 923 - 936
  • [4] Sentiment analysis based on improved pre-trained word embeddings
    Rezaeinia, Seyed Mahdi
    Rahmani, Rouhollah
    Ghodsi, Ali
    Veisi, Hadi
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 139 - 147
  • [5] Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets
    Ashi, Mohammed Matuq
    Siddiqui, Muazzam Ahmed
    Nadeem, Farrukh
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018, 2019, 845 : 241 - 251
  • [6] Effects of Pre-trained Word Embeddings on Text-based Deception Detection
    Nam, David
    Yasmin, Jerin
    Zulkernine, Farhana
    2020 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2020, : 437 - 443
  • [7] Transformer based contextualization of pre-trained word embeddings for irony detection in Twitter
    Angel Gonzalez, Jose
    Hurtado, Lluis-F
    Pla, Ferran
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [8] Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study
    Albalawi, Yahya
    Nikolov, Nikola S.
    Buckley, Jim
    JMIR FORMATIVE RESEARCH, 2022, 6 (06)
  • [9] Pre-processing Effects of the Tuberculosis Chest X-Ray Images on Pre-trained CNNs: An Investigation
    Tasci, Erdal
    ARTIFICIAL INTELLIGENCE AND APPLIED MATHEMATICS IN ENGINEERING PROBLEMS, 2020, 43 : 589 - 596
  • [10] Comparison of Pre-trained Word Vectors for Arabic Text Classification using Deep Learning Approach
    Alwehaibi, Ali
    Roy, Kaushik
    2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 1471 - 1474