A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

被引:0
作者
Usman Naseem
Imran Razzak
Peter W. Eklund
机构
[1] University of Sydney,
[2] Deakin University,undefined
来源
Multimedia Tools and Applications | 2021年 / 80卷
关键词
Natural language processing; Text pre-processing; Tweet classification; Machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.
引用
收藏
页码:35239 / 35266
页数:27
相关论文
共 54 条
  • [1] Alotaibi S(2020)Sehaa: a big data analytics tool for healthcare symptoms and diseases detection using twitter, apache spark, and machine learning Appl Sci 10 1398-196
  • [2] Mehmood R(2003)Summary from the KDD-03 panel: data mining: the next 10 years ACM SIGKDD Explor Newsl 5 191-2879
  • [3] Katib I(2017)Comparison research on text pre-processing methods on twitter sentiment analysis IEEE Access 5 2870-1, 01
  • [4] Rana O(2018)Deep convolution neural networks for twitter sentiment analysis IEEE Access PP 1-257
  • [5] Albeshri A(2014)Tom: Twitter opinion mining framework using hybrid classification scheme Decis Support Syst 57 245-762
  • [6] Fayyad UM(2014)Sentiment analysis of short informal texts J Artif Int Res 50 723-76
  • [7] Piatetsky-Shapiro G(2019)Deep context-aware embedding for abusive and hate speech detection on twitter Aust. J. Intell. Inf. Process. Syst. 15 69-69
  • [8] Uthurusamy R(2020)Transformer based deep intelligent contextual embedding for twitter sentiment analysis Future Gener Comp Syst 113 58-312
  • [9] Jianqiang Z(2019)What’s happening around the world? a survey and framework on event detection techniques on twitter J Grid Comput 17 279-132
  • [10] Xiaolin G(2020)Evesense: what can you sense from twitter? Adv Inform Retr 12036 491-523