Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets

被引:69
作者
Pota, Marco [1 ]
Ventura, Mirko [1 ]
Fujita, Hamido [2 ,3 ,4 ]
Esposito, Massimo [1 ]
机构
[1] CNR, Inst High Performance Comp & Networking ICAR, Naples, Italy
[2] Ho Chi Minh City Univ Technol HUTECH, Fac Informat Technol, Ho Chi Minh City, Vietnam
[3] Natl Taipei Univ Technol, Taipei, Taiwan
[4] I Somet Inc Assoc, Morioka, Iwate, Japan
基金
日本学术振兴会;
关键词
Sentiment analysis; Pre-processing; Twitter; English; Italian;
D O I
10.1016/j.eswa.2021.115119
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.
引用
收藏
页数:10
相关论文
共 53 条
[1]  
Alayba AM, 2017, 2017 1ST INTERNATIONAL WORKSHOP ON ARABIC SCRIPT ANALYSIS AND RECOGNITION (ASAR), P114, DOI 10.1109/ASAR.2017.8067771
[2]  
Alowaidi S, 2017, INT J ADV COMPUT SC, V8, P256
[3]  
[Anonymous], 2016, OVERVIEW EVALITA 201
[4]  
[Anonymous], 2016, RECURRENT MEMORY NET
[5]  
[Anonymous], 2013, NIPS
[6]  
[Anonymous], 2011, Proceedings of ACL-HLT
[7]  
Augenstein I., 2016, ABS160908359 CORR
[8]  
Basile P., 2016, OVERVIEW EVALITA 201, P40
[9]  
Basile V, 2014, Evalita 2014: Sentipolc Twitter dataset
[10]  
Basile V., 2013, Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, P100