Automatic offensive language detection from Twitter data using machine learning and feature selection of metadata

被引：12

作者：

De Souza, Gabriel Araujo ^{[1
]}

Da Costa-Abreu, Marjory ^{[2
]}

机构：

[1] Fed Univ Rio Grande do Norte UFRN, Natal, RN, Brazil

[2] Sheffield Hallam Univ, Sheffield, S Yorkshire, England

来源：

2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2020年

关键词：

Offensive Language Detection; Naive Bayes; Linear SVM; Attribute Selection; Twitter;

D O I：

10.1109/ijcnn48605.2020.9207652

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The popularity of social networks has only increased in recent years. In theory, the use of social media was proposed so we could share our views online, keep in contact with loved ones or share good moments of life. However, the reality is not so perfect, so you have people sharing hate speech-related messages, or using it to bully specific individuals, for instance, or even creating robots where their only goal is to target specific situations or people. Identifying who wrote such text is not easy and there are several possible ways of doing it, such as using natural language processing or machine learning algorithms that can investigate and perform predictions using the meta data associated with it. In this work, we present an initial investigation of which are the best machine learning techniques to detect offensive language in tweets. After an analysis of the current trend in the literature about the recent text classification techniques, we have selected Linear SVM and Naive Bayes algorithms for our initial tests. For the preprocessing of data, we have used different techniques for attribute selection that will be justified in the literature section. After our experiments, we have obtained 92% of accuracy and 95% of recall to detect offensive language with Naive Bayes and 90% of accuracy and 92% of recall with Linear SVM. From our understanding, these results overcome our related literature and are a good indicator of the importance of the data description approach we have used.

引用

页数：6

共 23 条

[11]

Jalaja G., 2019, SENTIMENT ANAL TEXT, P693

[12]

Mathur P, 2018, NATURAL LANGUAGE PROCESSING FOR SOCIAL MEDIA (AFNLP SIG SOCIALNLP), P18

[13]

Mossie Z., 2019, INFORM PROCESSING MA, P102087

[14]

Nikolov A., 2019, P 13 INT WORKSHOP SE, P691, DOI [DOI 10.18653/V1/S19-2123, 10.18653/v1/S19-2123]

[15]

Pitsilis G., 2017, ARXIV180104433

[16]

Risch Julian., 2019, PRELIMINARY P 15 C N, P403

[17]

Rish I., 2001, IJCAI 2001 WORKSH EM, V3, P41, DOI DOI 10.1039/B104835J

[18] Cyber Social Media Analytics and Issues: A Pragmatic Approach for Twitter Sentiment Analysis [J].

Sharma, Sanur ;

Jain, Anurag .

ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, IC4S 2018, 2019, 924 :473-484

[19]

Waseem Z., 2016, P NAACL STUDENT RES, P88, DOI [10.18653/v1/n16-2013, DOI 10.18653/V1/N16-2013]

[20] Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection [J].

Watanabe, Hajime ;

Bouazizi, Mondher ;

Ohtsuki, Tomoaki .

IEEE ACCESS, 2018, 6 :13825-13835

← 1 2 3 →