Arabic Fake News Detection in Social Media Context Using Word Embeddings and Pre-trained Transformers

被引:4
作者
Azzeh, Mohammad [1 ]
Qusef, Abdallah [1 ]
Alabboushi, Omar [1 ]
机构
[1] Princess Sumaya Univ Technol, King Hussain Sch Comp Sci, Amman, Jordan
关键词
Arabic fake news detection; Natural language processing; BERT; CAMeLBERT; AraBERT; ARBERT; MARBERT; AraELECTRA;
D O I
10.1007/s13369-024-08959-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The quick spread of fake news in different languages on social platforms has become a global scourge threatening societal security and the government. Fake news is usually written to deceive readers and convince them that this false information is correct; therefore, stopping the spread of this false information becomes a priority of governments and societies. Building fake news detection models for the Arabic language comes with its own set of challenges and limitations. Some of the main limitations include 1) lack of annotated data, 2) dialectal variations where each dialect can vary significantly in terms of vocabulary, grammar, and syntax, 3) morphological complexity with complex word formations and root-and-pattern morphology, 4) semantic ambiguity that make models fail to accurately discern the intent and context of a given piece of information, 5) cultural context and 6) diacrasy. The objective of this paper is twofold: first, we design a large corpus of annotated fake new data for the Arabic language from multiple sources. The corpus is collected from multiple sources to include different dialects and cultures. Second, we build fake detection by building machine learning models as model head over the fine-tuned large language models. These large language models were trained on Arabic language, such as ARBERT, AraBERT, CAMeLBERT, and the popular word embedding technique AraVec. The results showed that the text representations produced by the CAMeLBERT transformer are the most accurate because all models have outstanding evaluation results. We found that using the built deep learning classifiers with the transformer is generally better than classical machine learning classifiers. Finally, we could not find a stable conclusion concerning which model works well with each text representation method because each evaluation measure has a different favored model.
引用
收藏
页码:923 / 936
页数:14
相关论文
共 47 条
[41]  
Shu K., 2017, ACM SIGKDD explorations newsletter, V19, P22, DOI [10.1145/3137597.3137600, DOI 10.1145/3137597.3137600, 10.1145/3140000/3137600/p22-shu.pdf?ip130.154.51.250id3137600accACTIVESERVICEkeyD0E502E9DB58724B%2ED0E502E9DB58724B%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35acm1574437822eefa2eba66d50edd43420a4520c64505, DOI 10.1145/3140000/3137600/P22-SHU.PDF?IP130.154.51.250ID3137600ACCACTIVESERVICEKEYD0E502E9DB58724B%2ED0E502E9DB58724B%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35ACM1574437822EEFA2EBA66
[42]   A study of fake news reading and annotating in social media context [J].
Simko, Jakub ;
Racsko, Patrik ;
Tomlein, Matus ;
Hanakova, Martina ;
Moro, Robert ;
Bielikova, Maria .
NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA, 2021, 27 (1-2) :97-127
[43]   AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP [J].
Soliman, Abu Bakr ;
Eissa, Kareem ;
El-Beltagy, Samhaa R. .
ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017), 2017, 117 :256-265
[44]  
Sutanto D., 2015, ARPNJOURNALSORG, V10
[45]  
Traylor T, 2019, IEEE INT C SEMANT CO, P445, DOI [10.1109/ICSC.2019.00086, 10.1109/ICOSC.2019.8665593]
[46]   An Effective Hybrid Deep Neural Network for Arabic Fake News Detection [J].
Wotaifi, Tahseen A. ;
Dhannoon, Ban N. .
BAGHDAD SCIENCE JOURNAL, 2023, 20 (04) :1392-1401
[47]   A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities [J].
Zhou, Xinyi ;
Zafarani, Reza .
ACM COMPUTING SURVEYS, 2020, 53 (05)