Hate and offensive speech detection on Arabic social media

被引:42
作者
Alsafari S. [1 ,2 ]
Sadaoui S. [1 ]
Mouhoub M. [1 ]
机构
[1] University of Regina, Regina
[2] University of Jeddah, Jeddah
来源
Online Social Networks and Media | 2020年 / 19卷
关键词
Arabic corpus; Data annotation; Data extraction; Deep learning; Feature extraction; Hate speech; Multi-class classification; Social media;
D O I
10.1016/j.osnem.2020.100096
中图分类号
学科分类号
摘要
We are witnessing an increasing proliferation of hate speech on social media targeting individuals for their protected characteristics. Our study aims to devise an effective Arabic hate and offensive speech detection framework to address this serious issue. First, we built a reliable Arabic textual corpus by crawling data from Twitter using four robust extraction strategies that we implement based on four types of hate: religion, ethnicity, nationality, and gender. Next, we label the corpus based on a three-hierarchical annotation scheme in which we verify the inter annotation agreement to ensure ground truth at each level. Based on machine and deep learning techniques, we develop numerous two-class, three-class, and six-class classification models that we combine with a variety of feature extraction techniques, such as contextual word embeddings. Finally, we conduct an intensive experiment to assess the performance of the different learned models and to examine the misclassification errors. The performance results are very encouraging compared to prior hate and offensive speech studies carried out on Arabic and other languages. © 2020 Elsevier B.V.
引用
收藏
相关论文
共 28 条
  • [1] Kumar R., Ojha A.K., Malmasi S., Zampieri M., Benchmarking Aggression Identification in Social Media, Proc. First Work. Trolling, Aggress. Cyberbullying, pp. 1-11, (2018)
  • [2] Basile V., Bosco C., Fersini E., Nozza D., Patti V., Rangel Pardo F.M., Rosso P., Sanguinetti M., SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter, Proc. 13th Int. Work. Semant. Eval., Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp. 54-63, (2019)
  • [3] Zampieri M., Malmasi S., Nakov P., Rosenthal S., Farra N., Kumar R., SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval), Proc. 13th Int. Work. Semant. Eval., Association for Computational Linguistics, Minneapolis, Minnesota, USA, pp. 75-86, (2019)
  • [4] Williams M.L., Burnap P., Cyberhate on Social Media in the aftermath of Woolwich: A Case Study in Computational Criminology and Big Data, Br. J. Criminol., 56, pp. 211-238, (2016)
  • [5] Nobata C., Tetreault J., Thomas A., Mehdad Y., Chang Y., Abusive language detection in online user content, 25th Int. World Wide Web Conf. WWW 2016., pp. 145-153, (2016)
  • [6] Davidson T., Warmsley D., Macy M., Weber I., Automated hate speech detection and the problem of offensive language, Proc. 11th Int. Conf. Web Soc. Media, ICWSM 2017, pp. 512-515, (2017)
  • [7] Malmasi S., Zampieri M., Detecting Hate Speech in Social Media, Proc. Int. Conf. Recent Adv. Nat. Lang. Process. {RANLP} 2017, Varna, Bulgaria, pp. 467-472, (2017)
  • [8] Zampieri M., Malmasi S., Nakov P., Rosenthal S., Farra N., Kumar R., Predicting the type and target of offensive posts in social media, NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf. 1, pp. 1415-1420, (2019)
  • [9] Mubarak H., Darwish K., Magdy W., Abusive Language Detection on Arabic Social Media, Proc. First Work. Abus. Lang. Online, Association for Computational Linguistics, Vancouver, BC, Canada, pp. 52-56, (2017)
  • [10] Albadi N., Kurdi M., Mishra S., Are they our brothers? analysis and detection of religious hate speech in the Arabic Twittersphere, Proc. 2018 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2018, pp. 69-76, (2018)