COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter

被引:58
作者
Müller, Martin [1 ]
Salathe, Marcel [1 ]
Kummervold, Per E. [2 ]
机构
[1] EPFL, Digital Epidemiol Lab, Geneva, Switzerland
[2] FISABIO Publ Hlth, Vaccine Res Dept, Valencia, Spain
来源
FRONTIERS IN ARTIFICIAL INTELLIGENCE | 2023年 / 6卷
关键词
natural language processing (NLP); COVID-19; language model (LM); BERT; text classification;
D O I
10.3389/frai.2023.1023281
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
IntroductionThis study presents COVID-Twitter-BERT (CT-BERT), a transformer-based model that is pre-trained on a large corpus of COVID-19 related Twitter messages. CT-BERT is specifically designed to be used on COVID-19 content, particularly from social media, and can be utilized for various natural language processing tasks such as classification, question-answering, and chatbots. This paper aims to evaluate the performance of CT-BERT on different classification datasets and compare it with BERT-LARGE, its base model. MethodsThe study utilizes CT-BERT, which is pre-trained on a large corpus of COVID-19 related Twitter messages. The authors evaluated the performance of CT-BERT on five different classification datasets, including one in the target domain. The model's performance is compared to its base model, BERT-LARGE, to measure the marginal improvement. The authors also provide detailed information on the training process and the technical specifications of the model. ResultsThe results indicate that CT-BERT outperforms BERT-LARGE with a marginal improvement of 10-30% on all five classification datasets. The largest improvements are observed in the target domain. The authors provide detailed performance metrics and discuss the significance of these results. DiscussionThe study demonstrates the potential of pre-trained transformer models, such as CT-BERT, for COVID-19 related natural language processing tasks. The results indicate that CT-BERT can improve the classification performance on COVID-19 related content, especially on social media. These findings have important implications for various applications, such as monitoring public sentiment and developing chatbots to provide COVID-19 related information. The study also highlights the importance of using domain-specific pre-trained models for specific natural language processing tasks. Overall, this work provides a valuable contribution to the development of COVID-19 related NLP models.
引用
收藏
页数:6
相关论文
共 14 条
  • [1] [Anonymous], 2013, P 2013 C EMPIRICAL M
  • [2] Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
  • [3] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [4] Honnibal M., 2017, IN PRESS, DOI DOI 10.3233/978-1-60750-588-4-1080
  • [5] Categorizing Vaccine Confidence With a Transformer-Based Machine Learning Model: Analysis of Nuances of Vaccine Sentiment in Twitter Discourse
    Kummervold, Per E.
    Martin, Sam
    Dada, Sara
    Kilich, Eliz
    Denny, Chermain
    Paterson, Pauline
    Larson, Heidi J.
    [J]. JMIR MEDICAL INFORMATICS, 2021, 9 (10)
  • [6] Lan ZZ, 2020, Arxiv, DOI arXiv:1909.11942
  • [7] BioBERT: a pre-trained biomedical language representation model for biomedical text mining
    Lee, Jinhyuk
    Yoon, Wonjin
    Kim, Sungdong
    Kim, Donghyeon
    Kim, Sunkyu
    So, Chan Ho
    Kang, Jaewoo
    [J]. BIOINFORMATICS, 2020, 36 (04) : 1234 - 1240
  • [8] Liu YH, 2019, Arxiv, DOI [arXiv:1907.11692, DOI 10.48550/ARXIV.1907.11692]
  • [9] "Vaccines for pregnant women ...?! Absurd" - Mapping maternal vaccination discourse and stance on social media over six months
    Martin, Sam
    Kilich, Eliz
    Dada, Sara
    Kummervold, Per Egil
    Denny, Chermain
    Paterson, Pauline
    Larson, Heidi J.
    [J]. VACCINE, 2020, 38 (42) : 6627 - 6637
  • [10] Crowdbreaks: Tracking Health Trends Using Public Social Media Data and Crowdsourcing
    Mueller, Martin M.
    Salathe, Marcel
    [J]. FRONTIERS IN PUBLIC HEALTH, 2019, 7