Fake news detection in Urdu language using machine learning

被引:14
作者
Farooq, Muhammad Shoaib [1 ]
Naseem, Ansar [1 ]
Rustam, Furqan [2 ]
Ashraf, Imran [3 ]
机构
[1] Univ Management & Technol, Dept Comp Sci, Lahore, Pakistan
[2] Univ Management & Technol, Dept Software Engn, Lahore, Pakistan
[3] Yeungnam Univ, Informat & Commun Engn, Gyongsan, South Korea
关键词
Fake news detection; Ensemble learning; Machine learning; Urdu fake news;
D O I
10.7717/peerj-cs.1353
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rise of social media, the dissemination of forged content and news has been on the rise. Consequently, fake news detection has emerged as an important research problem. Several approaches have been presented to discriminate fake news from real news, however, such approaches lack robustness for multi-domain datasets, especially within the context of Urdu news. In addition, some studies use machine-translated datasets using English to Urdu Google translator and manual verification is not carried out. This limits the wide use of such approaches for real-world applications. This study investigates these issues and proposes fake news classier for Urdu news. The dataset has been collected covering nine different domains and constitutes 4097 news. Experiments are performed using the term frequency-inverse document frequency (TF-IDF) and a bag of words (BoW) with the combination of n-grams. The major contribution of this study is the use of feature stacking, where feature vectors of preprocessed text and verbs extracted from the preprocessed text are combined. Support vector machine, k-nearest neighbor, and ensemble models like random forest (RF) and extra tree (ET) were used for bagging while stacking was applied with ET and RF as base learners with logistic regression as the meta learner. To check the robustness of models, fivefold and independent set testing were employed. Experimental results indicate that stacking achieves 93.39%, 88.96%, 96.33%, 86.2%, and 93.17% scores for accuracy, specificity, sensitivity, MCC, ROC, and F1 score, respectively.
引用
收藏
页数:20
相关论文
共 21 条
[1]   Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques [J].
Ahmed, Hadeer ;
Traore, Issa ;
Saad, Sherif .
INTELLIGENT, SECURE, AND DEPENDABLE SYSTEMS IN DISTRIBUTED AND CLOUD ENVIRONMENTS (ISDDC 2017), 2017, 10618 :127-138
[2]   Supervised ensemble learning methods towards automatically filtering Urdu fake news within social media [J].
Akhter, Muhammad Pervez ;
Zheng, Jiangbin ;
Afzal, Farkhanda ;
Lin, Hui ;
Riaz, Saleem ;
Mehmood, Atif .
PEERJ COMPUTER SCIENCE, 2021, 7 :1-24
[3]  
Amjad M, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P2537
[4]   "Bend the truth": Benchmark dataset for fake news detection in Urdu language and its evaluation [J].
Amjad, Maaz ;
Sidorov, Grigori ;
Zhila, Alisa ;
Gomez-Adorno, Helena ;
Voronkov, Ilia ;
Gelbukh, Alexander .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) :2457-2469
[5]  
[Anonymous], 2021, More than eight-in-ten Americans get news from digital devices
[6]   Prediction of Therapeutic Peptides Using Machine Learning: Computational Models, Datasets, and Feature Encodings [J].
Attique, Muhammad ;
Farooq, Muhammad Shoaib ;
Khelifi, Adel ;
Abid, Adnan .
IEEE ACCESS, 2020, 8 :148570-148594
[7]  
Balouchzahi F, 2020, FIRE (Working Notes), P474
[8]  
Bozarth L, 2020, Proceedings of the International AAAI Conference on Web and Social Media, V14, P60, DOI [10.1609/icwsm.v14i1.7279, 10.1609/icwsm.v14i1.7279, DOI 10.1609/ICWSM.V14I1.7279]
[9]   The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation [J].
Chicco, Davide ;
Jurman, Giuseppe .
BMC GENOMICS, 2020, 21 (01)
[10]   COMMERCIAL APPLICATIONS OF NATURAL-LANGUAGE PROCESSING [J].
CHURCH, KW ;
RAU, LF .
COMMUNICATIONS OF THE ACM, 1995, 38 (11) :71-79