Improving classifier training efficiency for automatic cyberbullying detection with Feature Density

被引:24
作者
Eronen, Juuso [1 ]
Ptaszynski, Michal [1 ]
Masui, Fumito [1 ]
Smywinski-Pohl, Aleksander [2 ]
Leliwa, Gniewosz [3 ]
Wroczynski, Michal [3 ]
机构
[1] Kitami Inst Technol, Kitami, Hokkaido, Japan
[2] AGH Univ Sci & Technol, Krakow, Poland
[3] Samurailabs, Gdynia, Poland
关键词
Feature density; Dataset complexity; Linguistics; Cyberbullying; Document classification; Preprocessing; SYNTACTIC COMPLEXITY; IMPACT; TIMES; SIZE;
D O I
10.1016/j.ipm.2021.102616
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesize that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resourceintensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
引用
收藏
页数:37
相关论文
共 80 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
[Anonymous], 2002, CSLG0212012 CORR
[3]  
[Anonymous], 2007, Machine learning: ECML 2001, DOI DOI 10.1007/3-540-44795-43
[4]  
[Anonymous], 2014, PROC C EMPIRICAL MET, DOI DOI 10.3115/V1/D14-1181
[5]  
[Anonymous], 2012, ABS12070580 CORR
[6]  
[Anonymous], 2016, ABS160701759 CORR
[7]   Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification [J].
Arnal Barbedo, Jayme Garcia .
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2018, 153 :46-53
[8]  
Awekar A., 2018, ABS180106482 CORR
[9]   Improving cyberbullying detection using Twitter users' psychological features and machine learning [J].
Balakrishnan, Vimala ;
Khan, Shahzaib ;
Arabnia, Hamid R. .
COMPUTERS & SECURITY, 2020, 90
[10]   Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer [J].
Basavanhally, Ajay ;
Viswanath, Satish ;
Madabhushi, Anant .
PLOS ONE, 2015, 10 (05)