An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

被引:79
作者
Pota, Marco [1 ]
Ventura, Mirko [1 ]
Catelli, Rosario [1 ,2 ]
Esposito, Massimo [1 ]
机构
[1] CNR, Inst High Performance Comp & Networking ICAR, I-80131 Naples, Italy
[2] Univ Naples Federico II, Dept Elect Engn & Informat Technol DIETI, I-80125 Naples, Italy
关键词
sentiment analysis; NLP; language models; BERT; Italian language; QUESTION CLASSIFICATION;
D O I
10.3390/s21010133
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.
引用
收藏
页码:1 / 21
页数:21
相关论文
共 92 条
  • [1] Unsupervised Emotion Detection from Text using Semantic and Syntactic Relations
    Agrawal, Ameeta
    An, Aijun
    [J]. 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, : 346 - 353
  • [2] The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis
    Alam, Saqib
    Yao, Nianmin
    [J]. COMPUTATIONAL AND MATHEMATICAL ORGANIZATION THEORY, 2019, 25 (03) : 319 - 335
  • [3] A Combined CNN and LSTM Model for Arabic Sentiment Analysis
    Alayba, Abdulaziz M.
    Palade, Vasile
    England, Matthew
    Iqbal, Rahat
    [J]. MACHINE LEARNING AND KNOWLEDGE EXTRACTION, CD-MAKE 2018, 2018, 11015 : 179 - 191
  • [4] An intelligent healthcare monitoring framework using wearable sensors and social networking data
    Ali, Farman
    El-Sappagh, Shaker
    Islam, S. M. Riazul
    Ali, Amjad
    Attique, Muhammad
    Imran, Muhammad
    Kwak, Kyung-Sup
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 114 : 23 - 43
  • [5] Transportation sentiment analysis using word embedding and ontology-based topic modeling
    Ali, Farman
    Kwak, Daehan
    Khan, Pervez
    El-Sappagh, Shaker
    Ali, Amjad
    Ullah, Sana
    Kim, Kye Hyun
    Kwak, Kyung-Sup
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 174 : 27 - 42
  • [6] Angiani G., 2016, P 2 INT WORKSH KNOWL, V1748
  • [7] Enhancing deep learning sentiment analysis with ensemble techniques in social applications
    Araque, Oscar
    Corcuera-Platas, Ignacio
    Sanchez-Rada, J. Fernando
    Iglesias, Carlos A.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 77 : 236 - 246
  • [8] Attardi G., 2016, P 3 IT C COMP LING C, V1749
  • [9] TwitterBERT: Framework for Twitter Sentiment Analysis Based on Pre-trained Language Model Representations
    Azzouza, Noureddine
    Akli-Astouati, Karima
    Ibrahim, Roliana
    [J]. EMERGING TRENDS IN INTELLIGENT COMPUTING AND INFORMATICS: DATA SCIENCE, INTELLIGENT INFORMATION SYSTEMS AND SMART COMPUTING, 2020, 1073 : 428 - 437
  • [10] Babanejad N., 2020, P 58 ANN M ASS COMP, P5799