Arabic Fake News Detection in Social Media Context Using Word Embeddings and Pre-trained Transformers

被引:4
作者
Azzeh, Mohammad [1 ]
Qusef, Abdallah [1 ]
Alabboushi, Omar [1 ]
机构
[1] Princess Sumaya Univ Technol, King Hussain Sch Comp Sci, Amman, Jordan
关键词
Arabic fake news detection; Natural language processing; BERT; CAMeLBERT; AraBERT; ARBERT; MARBERT; AraELECTRA;
D O I
10.1007/s13369-024-08959-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The quick spread of fake news in different languages on social platforms has become a global scourge threatening societal security and the government. Fake news is usually written to deceive readers and convince them that this false information is correct; therefore, stopping the spread of this false information becomes a priority of governments and societies. Building fake news detection models for the Arabic language comes with its own set of challenges and limitations. Some of the main limitations include 1) lack of annotated data, 2) dialectal variations where each dialect can vary significantly in terms of vocabulary, grammar, and syntax, 3) morphological complexity with complex word formations and root-and-pattern morphology, 4) semantic ambiguity that make models fail to accurately discern the intent and context of a given piece of information, 5) cultural context and 6) diacrasy. The objective of this paper is twofold: first, we design a large corpus of annotated fake new data for the Arabic language from multiple sources. The corpus is collected from multiple sources to include different dialects and cultures. Second, we build fake detection by building machine learning models as model head over the fine-tuned large language models. These large language models were trained on Arabic language, such as ARBERT, AraBERT, CAMeLBERT, and the popular word embedding technique AraVec. The results showed that the text representations produced by the CAMeLBERT transformer are the most accurate because all models have outstanding evaluation results. We found that using the built deep learning classifiers with the transformer is generally better than classical machine learning classifiers. Finally, we could not find a stable conclusion concerning which model works well with each text representation method because each evaluation measure has a different favored model.
引用
收藏
页码:923 / 936
页数:14
相关论文
共 47 条
  • [1] Abd Elminaam Diaa Salama, 2023, 2023 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), P1, DOI 10.1109/MIUCC58832.2023.10278341
  • [2] Abdul-Mageed M., 2020, ARBERT MARBERT: Deep Bidirectional Transformers for Arabic, P7088
  • [3] Al-Laith A., 2021, ARTIC INT J ADV COMP, V12, P2021, DOI [10.14569/IJACSA.2021.0120691, DOI 10.14569/IJACSA.2021.0120691]
  • [4] Arabic Fake News Detection: Comparative Study of Neural Networks and Transformer-Based Approaches
    Al-Yahya, Maha
    Al-Khalifa, Hend
    Al-Baity, Heyam
    AlSaeed, Duaa
    Essam, Amr
    [J]. COMPLEXITY, 2021, 2021
  • [5] An Arabic Corpus of Fake News: Collection, Analysis and Classification
    Alkhair, Maysoon
    Meftouh, Karima
    Smaili, Kamel
    Othman, Nouha
    [J]. ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 292 - 302
  • [6] The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text
    Altszyler, Edgar
    Ribeiro, Sidarta
    Sigman, Mariano
    Fernandez Slezak, Diego
    [J]. CONSCIOUSNESS AND COGNITION, 2017, 56 : 178 - 187
  • [7] [Anonymous], 2017, P 3 AR NAT LANG PROC
  • [8] Antoun Wissam, 2020, 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), P519, DOI 10.1109/ICIoT48696.2020.9089487
  • [9] Antoun W., 2020, ARXIV, DOI DOI 10.48550/ARXIV.2012.15516
  • [10] Antoun W, 2020, LREC 2020 WORKSH LAN, P11, DOI DOI 10.48550/ARXIV.2003.00104