A Mixed Malay-English Language COVID-19 Twitter Dataset: A Sentiment Analysis

被引:2
作者
Kong, Jeffery T. H. [1 ]
Juwono, Filbert H. H. [2 ]
Ngu, Ik Ying [3 ]
Nugraha, I. Gde Dharma [4 ]
Maraden, Yan [4 ]
Wong, W. K. [2 ]
机构
[1] Curtin Univ Malaysia, Dept Elect & Comp Engn, Miri 98009, Malaysia
[2] Univ Southampton Malaysia, Comp Sci Program, Iskandar Puteri 79100, Malaysia
[3] Curtin Univ Malaysia, Dept Media & Commun, Miri 98009, Malaysia
[4] Univ Indonesia, Dept Elect Engn, Depok 16424, Indonesia
关键词
BPE; CNN; COVID-19; fake news; M-BERT; Malaysia; sentiment analysis;
D O I
10.3390/bdcc7020061
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over 67% in Malay language, 27% in English, 2% in Chinese, and 4% in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset.
引用
收藏
页数:15
相关论文
共 46 条
  • [21] Analysis of Public Sentiment on COVID-19 Vaccination Using Twitter
    Jayasurya, Gutti Gowri
    Kumar, Sanjay
    Singh, Binod Kumar
    Kumar, Vinay
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 9 (04) : 1101 - 1111
  • [22] Jose N, 2020, INT CONF ADVAN COMPU, P136, DOI [10.1109/icaccs48705.2020.9074205, 10.1109/ICACCS48705.2020.9074205]
  • [23] DEVELOPMENT OF SARAWAK MALAY LOCAL DIALECT ONLINE TRANSLATION TOOL
    Junaini, Syahrul N.
    Hwey, Azelina Luk Tzcr
    Sidi, Jonathan
    Abd Rahman, Khirulnizam
    [J]. PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT, VOL 1, 2009, : 459 - +
  • [24] Khaw YMJ, 2014, INT CONF ASIAN LANG, P170, DOI 10.1109/IALP.2014.6973524
  • [25] Kong J., 2022, MYCOVID SENTI
  • [26] Kumar Anukriti, 2021, Advances in Manufacturing and Industrial Engineering. Select Proceedings of ICAPIE 2019. Lecture Notes in Mechanical Engineering (LNME), P207, DOI 10.1007/978-981-15-8542-5_18
  • [27] Global news-making practices on Twitter: Exploring English-Chinese language boundary spanning
    Mao, Yuping
    Menchen-Trevino, Ericka
    [J]. JOURNAL OF INTERNATIONAL AND INTERCULTURAL COMMUNICATION, 2019, 12 (03) : 248 - 266
  • [28] Marathe A., 2021, P 2021 INT C COMMUNI, P1
  • [29] Image-based Text Classification using 2D Convolutional Neural Networks
    Merdivan, Erinc
    Vafeiadis, Anastasios
    Kalatzis, Dimitrios
    Hanke, Sten
    Kropf, Johannes
    Votis, Konstantinos
    Giakoumis, Dimitrios
    Tzovaras, Dimitrios
    Chen, Liming
    Hamzaoui, Raouf
    Geist, Matthieu
    [J]. 2019 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI 2019), 2019, : 144 - 149
  • [30] Mohammad S., 2016, Proceedings of the 7th workshop on computational approaches to subjectivity, sentiment and social media analysis, P174, DOI DOI 10.18653/V1/W16-0429