A Supervised Multi-class Multi-labelWord Embeddings Approach for Toxic Comment Classification

被引:20
作者
Carta, Salvatore [1 ]
Corriga, Andrea [1 ]
Mulas, Riccardo [1 ]
Recupero, Diego [1 ]
Saia, Roberto [1 ]
机构
[1] Univ Cagliari, Dept Math & Comp Sci, Via Osped 72, I-09124 Cagliari, Italy
来源
KDIR: PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL 1: KDIR | 2019年
关键词
Apache Spark; Word Embeddings; Sentiment Analysis; Supervised Approach; MODEL;
D O I
10.5220/0008110901050112
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, communications made by using the modern Internet-based opportunities have revolutionized the way people exchange information, allowing real-time discussions among a huge number of users. However, the advantages offered by such powerful instruments of communication are sometimes jeopardized by the dangers related to personal attacks that lead many people to leave a discussion that they were participating. Such a problem is related to the so-called toxic comments, i.e., personal attacks, verbal bullying and, more generally, an aggressive way in which many people participate in a discussion, which brings some participants to abandon it. By exploiting the Apache Spark big data framework and several word embeddings, this paper presents an approach able to operate a multi-class multi-label classification of a discussion within a range of six classes of toxicity. We evaluate such an approach by classifying a dataset of comments taken from the Wikipedia's talk page, according to a Kaggle challenge. The experimental results prove that, through the adoption of different sets of word embeddings, our supervised approach outperforms the state-of-the-art that operate by exploiting the canonical bag-of-word model. In addition, the adoption of a word embeddings defined in a similar scenario (i.e., discussions related to e-learning videos), proves that it is possible to improve the performance with respect to solutions employing state-of-the-art word embeddings.
引用
收藏
页码:105 / 112
页数:8
相关论文
共 38 条
[1]   Spark SQL: Relational Data Processing in Spark [J].
Armbrust, Michael ;
Xin, Reynold S. ;
Lian, Cheng ;
Huai, Yin ;
Liu, Davies ;
Bradley, Joseph K. ;
Meng, Xiangrui ;
Kaftan, Tomer ;
Franklint, Michael J. ;
Ghodsi, Ali ;
Zaharia, Matei .
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1383-1394
[2]  
Baccianella S, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION
[3]  
Boratto L., 2016, RECSYS POSTERS, V140
[4]   Using neural word embeddings to model user behavior and detect user segments [J].
Boratto, Ludovico ;
Carta, Salvatore ;
Fenu, Gianni ;
Saia, Roberto .
KNOWLEDGE-BASED SYSTEMS, 2016, 108 :5-14
[5]  
Buscaldi D., 2018, 2 INT C PERS TECHN, V927
[6]  
Cambria E., 2012, 25 INT FLAIRS C
[7]  
Cambria E., 2010, 2010 AAAI FALL S SER, P14
[8]   Tweet sentiment analysis with classifier ensembles [J].
da Silva, Nadia F. F. ;
Hruschka, Eduardo R. ;
Hruschka, Estevam R., Jr. .
DECISION SUPPORT SYSTEMS, 2014, 66 :170-179
[9]  
Dessi Danilo, 2018, Trends and Advances in Information Systems and Technologies. Advances in Intelligent Systems and Computing (AISC 746), P1386, DOI 10.1007/978-3-319-77712-2_133
[10]  
Devitt Ann., 2007, SENTIMENT POLARITY I, P984