Arabic Fake News Detection in Social Media Context Using Word Embeddings and Pre-trained Transformers

被引：4

作者：

Azzeh, Mohammad ^{[1
]}

Qusef, Abdallah ^{[1
]}

Alabboushi, Omar ^{[1
]}

机构：

[1] Princess Sumaya Univ Technol, King Hussain Sch Comp Sci, Amman, Jordan

来源：

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING | 2025年 / 50卷 / 02期

关键词：

Arabic fake news detection; Natural language processing; BERT; CAMeLBERT; AraBERT; ARBERT; MARBERT; AraELECTRA;

D O I：

10.1007/s13369-024-08959-x

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The quick spread of fake news in different languages on social platforms has become a global scourge threatening societal security and the government. Fake news is usually written to deceive readers and convince them that this false information is correct; therefore, stopping the spread of this false information becomes a priority of governments and societies. Building fake news detection models for the Arabic language comes with its own set of challenges and limitations. Some of the main limitations include 1) lack of annotated data, 2) dialectal variations where each dialect can vary significantly in terms of vocabulary, grammar, and syntax, 3) morphological complexity with complex word formations and root-and-pattern morphology, 4) semantic ambiguity that make models fail to accurately discern the intent and context of a given piece of information, 5) cultural context and 6) diacrasy. The objective of this paper is twofold: first, we design a large corpus of annotated fake new data for the Arabic language from multiple sources. The corpus is collected from multiple sources to include different dialects and cultures. Second, we build fake detection by building machine learning models as model head over the fine-tuned large language models. These large language models were trained on Arabic language, such as ARBERT, AraBERT, CAMeLBERT, and the popular word embedding technique AraVec. The results showed that the text representations produced by the CAMeLBERT transformer are the most accurate because all models have outstanding evaluation results. We found that using the built deep learning classifiers with the transformer is generally better than classical machine learning classifiers. Finally, we could not find a stable conclusion concerning which model works well with each text representation method because each evaluation measure has a different favored model.

引用

页码：923 / 936

页数：14

共 47 条

[1] Abd Elminaam Diaa Salama, 2023, 2023 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), P1, DOI 10.1109/MIUCC58832.2023.10278341
[2] Abdul-Mageed M., 2020, ARBERT MARBERT: Deep Bidirectional Transformers for Arabic, P7088
[3] Al-Laith A., 2021, ARTIC INT J ADV COMP, V12, P2021, DOI [10.14569/IJACSA.2021.0120691, DOI 10.14569/IJACSA.2021.0120691]
[4] Arabic Fake News Detection: Comparative Study of Neural Networks and Transformer-Based Approaches
Al-Yahya, Maha
Al-Khalifa, Hend
Al-Baity, Heyam
AlSaeed, Duaa
Essam, Amr
[J]. COMPLEXITY, 2021, 2021
[5] An Arabic Corpus of Fake News: Collection, Analysis and Classification
Alkhair, Maysoon
Meftouh, Karima
Smaili, Kamel
Othman, Nouha
[J]. ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 292 - 302
[6] The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text
Altszyler, Edgar
Ribeiro, Sidarta
Sigman, Mariano
Fernandez Slezak, Diego
[J]. CONSCIOUSNESS AND COGNITION, 2017, 56 : 178 - 187
[7] [Anonymous], 2017, P 3 AR NAT LANG PROC
[8] Antoun Wissam, 2020, 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), P519, DOI 10.1109/ICIoT48696.2020.9089487
[9] Antoun W., 2020, ARXIV, DOI DOI 10.48550/ARXIV.2012.15516
[10] Antoun W, 2020, LREC 2020 WORKSH LAN, P11, DOI DOI 10.48550/ARXIV.2003.00104

← 1 2 3 4 5 →