Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification

被引:39
作者
Maslej-Kresnakova, Viera [1 ]
Sarnovsky, Martin [1 ]
Butka, Peter [1 ]
Machova, Kristina [1 ]
机构
[1] Tech Univ Kosice, Fac Elect Engn & Informat, Dept Cybernet & Artificial Intelligence, Kosice 04001, Slovakia
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 23期
关键词
natural language processing; toxic comments; classification; deep learning; neural networks; NETWORKS;
D O I
10.3390/app10238631
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The emergence of anti-social behaviour in online environments presents a serious issue in today's society. Automatic detection and identification of such behaviour are becoming increasingly important. Modern machine learning and natural language processing methods can provide effective tools to detect different types of anti-social behaviour from the pieces of text. In this work, we present a comparison of various deep learning models used to identify the toxic comments in the Internet discussions. Our main goal was to explore the effect of the data preparation on the model performance. As we worked with the assumption that the use of traditional pre-processing methods may lead to the loss of characteristic traits, specific for toxic content, we compared several popular deep learning and transformer language models. We aimed to analyze the influence of different pre-processing techniques and text representations including standard TF-IDF, pre-trained word embeddings and also explored currently popular transformer models. Experiments were performed on the dataset from the Kaggle Toxic Comment Classification competition, and the best performing model was compared with the similar approaches using standard metrics used in data analysis.
引用
收藏
页码:1 / 26
页数:26
相关论文
共 64 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms [J].
Agrawal, Sweta ;
Awekar, Amit .
ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 :141-153
[3]  
Al-Ajlan MA, 2018, INT J ADV COMPUT SC, V9, P199
[4]   Detecting Toxicity Triggers in Online Discussions [J].
Almerekhi, Hind ;
Kwak, Haewoon ;
Jansen, Bernard J. ;
Salminen, Joni .
PROCEEDINGS OF THE 30TH ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA (HT '19), 2019, :291-292
[5]   Classification of Abusive Comments in Social Media using Deep Learning [J].
Anand, Mukul ;
Eswari, R. .
PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2019), 2019, :974-977
[6]  
Anindyati L., 2019, 2019 INT C ADV INF C, DOI [10.1109/ICAICTA.2019.8904108, DOI 10.1109/ICAICTA.2019.8904108]
[7]  
[Anonymous], 2015, INT C LEARN REPR
[8]  
[Anonymous], P LECT NOTES COMPUTE, DOI DOI 10.1007/978-3-030-32381-3_16
[9]  
[Anonymous], 2013, J MICROELECTRON ELEC
[10]  
[Anonymous], 2018, P ACM INT C P SER PA, DOI DOI 10.1145/3200947.3208069