ETHOS: a multi-label hate speech detection dataset

被引：0

作者：

Ioannis Mollas

Zoe Chrysopoulou

Stamatis Karlos

Grigorios Tsoumakas

机构：

[1] Aristotle University of Thessaloniki,Department of Informatics

来源：

Complex & Intelligent Systems | 2022年 / 8卷

关键词：

Hate speech; Dataset; Machine learning; Multi-label; Classification; Active learning; I.2.6; I.2.7; I.5.4; H.2.4;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of users log in every day, and protection from exposure to similar phenomena appears to be necessary to comply with the corresponding legislation and maintain a high level of service quality. A robust and reliable system for detecting and preventing the uploading of relevant content will have a significant impact on our digitally interconnected society. Several aspects of our daily lives are undeniably linked to our social profiles, making us vulnerable to abusive behaviours. As a result, the lack of accurate hate speech detection mechanisms would severely degrade the overall user experience, although its erroneous operation would pose many ethical concerns. In this paper, we present ‘ETHOS’ (multi-labEl haTe speecH detectiOn dataSet), a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform. Furthermore, we present the annotation protocol used to create this dataset: an active sampling procedure for balancing our data in relation to the various aspects defined. Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material.

引用

页码：4663 / 4678

页数：15

共 36 条

[1] Almeida T(2013)Towards sms spam filtering: results under a new dataset Int J Inform Secur Sci 2 1-18
[2] Hidalgo JMG(2016)Large scale biomedical texts classification: a knn and an esa-based approaches J Biomed Semant 7 40-785
[3] Silva TP(2018)Sentiment analysis and twitter: a game proposal Pers. Ubiquitous Comput. 22 771-476
[4] Dramé K(2015)Optimised probabilistic active learning (OPAL) - for fast, non-myopic, cost-sensitive active classification Mach Learn 100 449-945
[5] Mougin F(2020)Active learning query strategies for classification, regression, and clustering: a survey J Comput Sci Technol 35 913-41
[6] Diallo G(1995)Wordnet: a lexical database for english Commun ACM 38 39-288
[7] Furini M(2018)Statistical comparisons of active learning strategies over multiple datasets Knowl Based Syst 145 274-104
[8] Montangero M(2021)tax2vec: constructing interpretable features from taxonomies for short text classification Comput Speech Lang 65 101-17
[9] Krempl G(2018)A social-aware online short-text feature selection technique for social media Inf Fus. 40 1-13
[10] Kottke D(2019)Short-text learning in social media: a review Knowl Eng Rev 34 e7-80

← 1 2 3 4 →