COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter

被引：58

作者：

Müller, Martin ^{[1
]}

Salathe, Marcel ^{[1
]}

Kummervold, Per E. ^{[2
]}

机构：

[1] EPFL, Digital Epidemiol Lab, Geneva, Switzerland

[2] FISABIO Publ Hlth, Vaccine Res Dept, Valencia, Spain

来源：

FRONTIERS IN ARTIFICIAL INTELLIGENCE | 2023年 / 6卷

关键词：

natural language processing (NLP); COVID-19; language model (LM); BERT; text classification;

D O I：

10.3389/frai.2023.1023281

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

IntroductionThis study presents COVID-Twitter-BERT (CT-BERT), a transformer-based model that is pre-trained on a large corpus of COVID-19 related Twitter messages. CT-BERT is specifically designed to be used on COVID-19 content, particularly from social media, and can be utilized for various natural language processing tasks such as classification, question-answering, and chatbots. This paper aims to evaluate the performance of CT-BERT on different classification datasets and compare it with BERT-LARGE, its base model. MethodsThe study utilizes CT-BERT, which is pre-trained on a large corpus of COVID-19 related Twitter messages. The authors evaluated the performance of CT-BERT on five different classification datasets, including one in the target domain. The model's performance is compared to its base model, BERT-LARGE, to measure the marginal improvement. The authors also provide detailed information on the training process and the technical specifications of the model. ResultsThe results indicate that CT-BERT outperforms BERT-LARGE with a marginal improvement of 10-30% on all five classification datasets. The largest improvements are observed in the target domain. The authors provide detailed performance metrics and discuss the significance of these results. DiscussionThe study demonstrates the potential of pre-trained transformer models, such as CT-BERT, for COVID-19 related natural language processing tasks. The results indicate that CT-BERT can improve the classification performance on COVID-19 related content, especially on social media. These findings have important implications for various applications, such as monitoring public sentiment and developing chatbots to provide COVID-19 related information. The study also highlights the importance of using domain-specific pre-trained models for specific natural language processing tasks. Overall, this work provides a valuable contribution to the development of COVID-19 related NLP models.

引用

页数：6

共 14 条

[1] [Anonymous], 2013, P 2013 C EMPIRICAL M
[2] Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[3] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[4] Honnibal M., 2017, IN PRESS, DOI DOI 10.3233/978-1-60750-588-4-1080
[5] Categorizing Vaccine Confidence With a Transformer-Based Machine Learning Model: Analysis of Nuances of Vaccine Sentiment in Twitter Discourse
Kummervold, Per E.
Martin, Sam
Dada, Sara
Kilich, Eliz
Denny, Chermain
Paterson, Pauline
Larson, Heidi J.
[J]. JMIR MEDICAL INFORMATICS, 2021, 9 (10)
[6] Lan ZZ, 2020, Arxiv, DOI arXiv:1909.11942
[7] BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Lee, Jinhyuk
Yoon, Wonjin
Kim, Sungdong
Kim, Donghyeon
Kim, Sunkyu
So, Chan Ho
Kang, Jaewoo
[J]. BIOINFORMATICS, 2020, 36 (04) : 1234 - 1240
[8] Liu YH, 2019, Arxiv, DOI [arXiv:1907.11692, DOI 10.48550/ARXIV.1907.11692]
[9] "Vaccines for pregnant women ...?! Absurd" - Mapping maternal vaccination discourse and stance on social media over six months
Martin, Sam
Kilich, Eliz
Dada, Sara
Kummervold, Per Egil
Denny, Chermain
Paterson, Pauline
Larson, Heidi J.
[J]. VACCINE, 2020, 38 (42) : 6627 - 6637
[10] Crowdbreaks: Tracking Health Trends Using Public Social Media Data and Crowdsourcing
Mueller, Martin M.
Salathe, Marcel
[J]. FRONTIERS IN PUBLIC HEALTH, 2019, 7

← 1 2 →