Improving hate speech detection using Cross-Lingual Learning

被引:8
作者
Firmino, Anderson Almeida [1 ]
Baptista, Claudio de Souza [1 ]
de Paiva, Anselmo Cardoso [2 ]
机构
[1] Univ Fed Campina Grande, Rua Aprigio Veloso 882, Campina Grande, PB, Brazil
[2] Univ Fed Maranhao, Ave Portugueses 1966, Sao Luis, MA, Brazil
关键词
Hate speech detection; Natural language processing; Social media; Cross-Lingual Learning; Deep learning;
D O I
10.1016/j.eswa.2023.121115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The growth of social media worldwide has brought social benefits and challenges. One problem we highlight is the proliferation of hate speech on social media. We propose a novel method for detecting hate speech in texts using Cross-Lingual Learning. Our approach uses transfer learning from Pre-Trained Language Models (PTLM) with large corpora available to solve problems in languages with fewer resources for the specific task. The proposed methodology comprises four stages: corpora acquisition, the PTLM definition, training strategies, and evaluation. We carried out experiments using Pre-Trained Language Models in English, Italian, and Portuguese (BERT and XLM-R) to verify which best suited the proposed method. We used corpora in English (WH) and Italian (Evalita 2018) as the source language and the OffComBr-2 corpus in Portuguese (the target language). The results of the experiments showed that the proposed methodology is promising: for the OffComBr-2 corpus, the best state-of-the-art result was obtained (F1-measure = 92%).
引用
收藏
页数:13
相关论文
共 58 条
  • [1] Deep Learning for Hate Speech Detection in Tweets
    Badjatiya, Pinkesh
    Gupta, Shashank
    Gupta, Manish
    Varma, Vasudeva
    [J]. WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2017, : 759 - 760
  • [2] Bassignana E., 2018, CEUR Workshop Proceedings, V2253
  • [3] On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
    Bender, Emily M.
    Gebru, Timnit
    McMillan-Major, Angelina
    Shmitchell, Shmargaret
    [J]. PROCEEDINGS OF THE 2021 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2021, 2021, : 610 - 623
  • [4] Bhaskaran J, 2019, GENDER BIAS IN NATURAL LANGUAGE PROCESSING (GEBNLP 2019), P62
  • [5] Bigoulaeva I., 2021, P 1 WORKSHOP LANGUAG, P15
  • [6] Bosco C., 2018, CEUR WORKSHOP P
  • [7] Automatic Classification of Abusive Language and Personal Attacks in Various Forms of Online Communication
    Bourgonje, Peter
    Moreno-Schneider, Julian
    Srivastava, Ankit
    Rehm, Georg
    [J]. LANGUAGE TECHNOLOGIES FOR THE CHALLENGES OF THE DIGITAL AGE, GSCL 2017, 2018, 10713 : 180 - 191
  • [8] Us and them: identifying cyber hate on Twitter across multiple protected characteristics
    Burnap, Pete
    Williams, Matthew L.
    [J]. EPJ DATA SCIENCE, 2016, 5
  • [9] Chung YL, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2819
  • [10] Unsupervised Cross-lingual Representation Learning for Speech Recognition
    Conneau, Alexis
    Baevski, Alexei
    Collobert, Ronan
    Mohamed, Abdelrahman
    Auli, Michael
    [J]. INTERSPEECH 2021, 2021, : 2426 - 2430