Context-aware and expert data resources for Brazilian Portuguese hate speech detection

被引:0
|
作者
Vargas, Francielle [1 ,2 ]
Carvalho, Isabelle [1 ]
Pardo, Thiago A. S. [1 ]
Benevenuto, Fabricio [2 ]
机构
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, Brazil
[2] Univ Fed Minas Gerais, Comp Sci Dept, Belo Horizonte, Brazil
来源
NATURAL LANGUAGE PROCESSING | 2025年 / 31卷 / 02期
关键词
hate speech; Brazilian Portuguese; low-resource languages; RELIABILITY; PRAGMATICS;
D O I
10.1017/nlp.2024.18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper provides data resources for low-resource hate speech detection. Specifically, we introduce two different data resources: (i) the HateBR 2.0 corpus, which is composed of 7,000 comments extracted from Brazilian politicians' accounts on Instagram and manually annotated a binary class (offensive versus non-offensive) and hate speech targets. It consists of an updated version of the HateBR corpus, in which highly similar and one-word comments were replaced; and (ii) the multilingual offensive lexicon (MOL), which consists of 1,000 explicit and implicit terms and expressions annotated with context information. The lexicon also comprises native-speaker translations and its cultural adaptations in English, Spanish, French, German, and Turkish. Both corpus and lexicon were annotated by three different experts and achieved high inter-annotator agreement. Lastly, we implemented baseline experiments on the proposed data resources. Results demonstrate the reliability of data outperforming baseline dataset results in Portuguese, besides presenting promising results for hate speech detection in different languages.
引用
收藏
页码:435 / 456
页数:22
相关论文
共 22 条
  • [21] Automatic Hate Speech Detection in English-Odia Code Mixed Social Media Data Using Machine Learning Techniques
    Mohapatra, Sudhir Kumar
    Prasad, Srinivas
    Bebarta, Dwiti Krishna
    Das, Tapan Kumar
    Srinivasan, Kathiravan
    Hu, Yuh-Chung
    APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [22] The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi-English (Code-Mixed) Tweets
    Al-Hussaeni, Khalil
    Sameer, Mohamed
    Karamitsos, Ioannis
    APPLIED SCIENCES-BASEL, 2023, 13 (19):