A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

被引：0

作者：

Usman Naseem

Imran Razzak

Peter W. Eklund

机构：

[1] University of Sydney,

[2] Deakin University,undefined

来源：

Multimedia Tools and Applications | 2021年 / 80卷

关键词：

Natural language processing; Text pre-processing; Tweet classification; Machine learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.

引用

页码：35239 / 35266

页数：27

共 54 条

[1] Alotaibi S(2020)Sehaa: a big data analytics tool for healthcare symptoms and diseases detection using twitter, apache spark, and machine learning Appl Sci 10 1398-196
[2] Mehmood R(2003)Summary from the KDD-03 panel: data mining: the next 10 years ACM SIGKDD Explor Newsl 5 191-2879
[3] Katib I(2017)Comparison research on text pre-processing methods on twitter sentiment analysis IEEE Access 5 2870-1, 01
[4] Rana O(2018)Deep convolution neural networks for twitter sentiment analysis IEEE Access PP 1-257
[5] Albeshri A(2014)Tom: Twitter opinion mining framework using hybrid classification scheme Decis Support Syst 57 245-762
[6] Fayyad UM(2014)Sentiment analysis of short informal texts J Artif Int Res 50 723-76
[7] Piatetsky-Shapiro G(2019)Deep context-aware embedding for abusive and hate speech detection on twitter Aust. J. Intell. Inf. Process. Syst. 15 69-69
[8] Uthurusamy R(2020)Transformer based deep intelligent contextual embedding for twitter sentiment analysis Future Gener Comp Syst 113 58-312
[9] Jianqiang Z(2019)What’s happening around the world? a survey and framework on event detection techniques on twitter J Grid Comput 17 279-132
[10] Xiaolin G(2020)Evesense: what can you sense from twitter? Adv Inform Retr 12036 491-523

← 1 2 3 4 5 6 →