A Mixed Malay-English Language COVID-19 Twitter Dataset: A Sentiment Analysis

被引:2
作者
Kong, Jeffery T. H. [1 ]
Juwono, Filbert H. H. [2 ]
Ngu, Ik Ying [3 ]
Nugraha, I. Gde Dharma [4 ]
Maraden, Yan [4 ]
Wong, W. K. [2 ]
机构
[1] Curtin Univ Malaysia, Dept Elect & Comp Engn, Miri 98009, Malaysia
[2] Univ Southampton Malaysia, Comp Sci Program, Iskandar Puteri 79100, Malaysia
[3] Curtin Univ Malaysia, Dept Media & Commun, Miri 98009, Malaysia
[4] Univ Indonesia, Dept Elect Engn, Depok 16424, Indonesia
关键词
BPE; CNN; COVID-19; fake news; M-BERT; Malaysia; sentiment analysis;
D O I
10.3390/bdcc7020061
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over 67% in Malay language, 27% in English, 2% in Chinese, and 4% in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset.
引用
收藏
页数:15
相关论文
共 46 条
  • [1] Abu Bakar MFR, 2019, INT CONF ASIAN LANG, P211, DOI [10.1109/ialp48816.2019.9037700, 10.1109/IALP48816.2019.9037700]
  • [2] Sentiment Analysis of Noisy Malay Text: State of Art, Challenges and Future Work
    Abu Bakar, Muhammad Fakhrur Razi
    Idris, Norisma
    Shuib, Liyana
    Khamis, Norazlina
    [J]. IEEE ACCESS, 2020, 8 : 24687 - 24696
  • [3] Afroz N., 2021, P INT C ART INT SMAR, P710
  • [4] Balancing between holistic and cumulative sentiment classification
    Agathangelou, Pantelis
    Katakis, Ioannis
    [J]. ONLINE SOCIAL NETWORKS AND MEDIA, 2022, 29
  • [5] Malay sentiment analysis based on combined classification approaches and Senti-lexicon algorithm
    Al-Saffar, Ahmed
    Awang, Suryanti
    Tao, Hai
    Omar, Nazlia
    Al-Saiagh, Wafaa
    Al-bared, Mohammed
    [J]. PLOS ONE, 2018, 13 (04):
  • [6] Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information
    Alharbi, Ahmed Sulaiman M.
    de Doncker, Elise
    [J]. COGNITIVE SYSTEMS RESEARCH, 2019, 54 : 50 - 61
  • [7] Neural Transfer Learning For Vietnamese Sentiment Analysis Using Pre-trained Contextual Language Models
    An Pha Le
    Tran Vu Pham
    Thanh-Van Le
    Huynh, Duy, V
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLIED NETWORK TECHNOLOGIES (ICMLANT II), 2021, : 84 - 88
  • [8] Aspect Based Twitter Sentiment Analysis on Vaccination and Vaccine Types in COVID-19 Pandemic With Deep Learning
    Aygun, Irfan
    Kaya, Buket
    Kaya, Mehmet
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (05) : 2360 - 2369
  • [9] Baliyan Anupam, 2021, Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), P710, DOI 10.1109/INDIACom51348.2021.00126
  • [10] bin Rodzman SB, 2019, 2019 IEEE 9TH SYMPOSIUM ON COMPUTER APPLICATIONS & INDUSTRIAL ELECTRONICS (ISCAIE), P330, DOI [10.1109/iscaie.2019.8743942, 10.1109/ISCAIE.2019.8743942]