Context-aware and expert data resources for Brazilian Portuguese hate speech detection

被引：0

作者：

Vargas, Francielle ^{[1
,2
]}

Carvalho, Isabelle ^{[1
]}

Pardo, Thiago A. S. ^{[1
]}

Benevenuto, Fabricio ^{[2
]}

机构：

[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, Brazil

[2] Univ Fed Minas Gerais, Comp Sci Dept, Belo Horizonte, Brazil

来源：

NATURAL LANGUAGE PROCESSING | 2025年 / 31卷 / 02期

关键词：

hate speech; Brazilian Portuguese; low-resource languages; RELIABILITY; PRAGMATICS;

D O I：

10.1017/nlp.2024.18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper provides data resources for low-resource hate speech detection. Specifically, we introduce two different data resources: (i) the HateBR 2.0 corpus, which is composed of 7,000 comments extracted from Brazilian politicians' accounts on Instagram and manually annotated a binary class (offensive versus non-offensive) and hate speech targets. It consists of an updated version of the HateBR corpus, in which highly similar and one-word comments were replaced; and (ii) the multilingual offensive lexicon (MOL), which consists of 1,000 explicit and implicit terms and expressions annotated with context information. The lexicon also comprises native-speaker translations and its cultural adaptations in English, Spanish, French, German, and Turkish. Both corpus and lexicon were annotated by three different experts and achieved high inter-annotator agreement. Lastly, we implemented baseline experiments on the proposed data resources. Results demonstrate the reliability of data outperforming baseline dataset results in Portuguese, besides presenting promising results for hate speech detection in different languages.

引用

页码：435 / 456

页数：22

共 22 条

[21] Automatic Hate Speech Detection in English-Odia Code Mixed Social Media Data Using Machine Learning Techniques
Mohapatra, Sudhir Kumar
Prasad, Srinivas
Bebarta, Dwiti Krishna
Das, Tapan Kumar
Srinivasan, Kathiravan
Hu, Yuh-Chung
APPLIED SCIENCES-BASEL, 2021, 11 (18):
[22] The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi-English (Code-Mixed) Tweets
Al-Hussaeni, Khalil
Sameer, Mohamed
Karamitsos, Ioannis
APPLIED SCIENCES-BASEL, 2023, 13 (19):

← 1 2 3 →