A Supervised Multi-class Multi-labelWord Embeddings Approach for Toxic Comment Classification

被引：20

作者：

Carta, Salvatore ^{[1
]}

Corriga, Andrea ^{[1
]}

Mulas, Riccardo ^{[1
]}

Recupero, Diego ^{[1
]}

Saia, Roberto ^{[1
]}

机构：

[1] Univ Cagliari, Dept Math & Comp Sci, Via Osped 72, I-09124 Cagliari, Italy

来源：

KDIR: PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL 1: KDIR | 2019年

关键词：

Apache Spark; Word Embeddings; Sentiment Analysis; Supervised Approach; MODEL;

D O I：

10.5220/0008110901050112

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Nowadays, communications made by using the modern Internet-based opportunities have revolutionized the way people exchange information, allowing real-time discussions among a huge number of users. However, the advantages offered by such powerful instruments of communication are sometimes jeopardized by the dangers related to personal attacks that lead many people to leave a discussion that they were participating. Such a problem is related to the so-called toxic comments, i.e., personal attacks, verbal bullying and, more generally, an aggressive way in which many people participate in a discussion, which brings some participants to abandon it. By exploiting the Apache Spark big data framework and several word embeddings, this paper presents an approach able to operate a multi-class multi-label classification of a discussion within a range of six classes of toxicity. We evaluate such an approach by classifying a dataset of comments taken from the Wikipedia's talk page, according to a Kaggle challenge. The experimental results prove that, through the adoption of different sets of word embeddings, our supervised approach outperforms the state-of-the-art that operate by exploiting the canonical bag-of-word model. In addition, the adoption of a word embeddings defined in a similar scenario (i.e., discussions related to e-learning videos), proves that it is possible to improve the performance with respect to solutions employing state-of-the-art word embeddings.

引用

页码：105 / 112

页数：8

共 38 条

[1] Spark SQL: Relational Data Processing in Spark [J].

Armbrust, Michael ;

Xin, Reynold S. ;

Lian, Cheng ;

Huai, Yin ;

Liu, Davies ;

Bradley, Joseph K. ;

Meng, Xiangrui ;

Kaftan, Tomer ;

Franklint, Michael J. ;

Ghodsi, Ali ;

Zaharia, Matei .

SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1383-1394

[2]

Baccianella S, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION

[3]

Boratto L., 2016, RECSYS POSTERS, V140

[4] Using neural word embeddings to model user behavior and detect user segments [J].

Boratto, Ludovico ;

Carta, Salvatore ;

Fenu, Gianni ;

Saia, Roberto .

KNOWLEDGE-BASED SYSTEMS, 2016, 108 :5-14

[5]

Buscaldi D., 2018, 2 INT C PERS TECHN, V927

[6]

Cambria E., 2012, 25 INT FLAIRS C

[7]

Cambria E., 2010, 2010 AAAI FALL S SER, P14

[8] Tweet sentiment analysis with classifier ensembles [J].

da Silva, Nadia F. F. ;

Hruschka, Eduardo R. ;

Hruschka, Estevam R., Jr. .

DECISION SUPPORT SYSTEMS, 2014, 66 :170-179

[9]

Dessi Danilo, 2018, Trends and Advances in Information Systems and Technologies. Advances in Intelligent Systems and Computing (AISC 746), P1386, DOI 10.1007/978-3-319-77712-2_133

[10]

Devitt Ann., 2007, SENTIMENT POLARITY I, P984

← 1 2 3 4 →