Development of a COVID-19-Related Anti-Asian Tweet Data Set: Quantitative Study

被引:2
|
作者
Mokhberi, Maryam [1 ]
Biswas, Ahana [2 ]
Masud, Zarif [1 ,4 ,5 ]
Kteily-Hawa, Roula [3 ]
Goldstein, Abby [4 ]
Gillis, Joseph Roy [4 ]
Rayana, Shebuti [5 ]
Ahmed, Syed Ishtiaque [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, 145 Cosburn Ave, Toronto, ON M4J 2L2, Canada
[2] Indian Inst Technol Kanpur, Kanpur, India
[3] Brescia Univ Coll Western, London, ON, Canada
[4] Ontario Inst Studies Educ, Toronto, ON, Canada
[5] SUNY Old Westbury, Math Comp & Informat Sci, Old Westbury, NY USA
关键词
COVID-19; stigma; hate speech; classification; annotation; data set; Sinophobia; Twitter; BERT; pandemic; data; online; community; Asian; research; discrimination; STIGMA; HATE;
D O I
10.2196/40403
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Since the advent of the COVID-19 pandemic, individuals of Asian descent (colloquial usage prevalent in North America, where "Asian" is used to refer to people from East Asia, particularly China) have been the subject of stigma and hate speech in both offline and online communities. One of the major venues for encountering such unfair attacks is social networks, such as Twitter. As the research community seeks to understand, analyze, and implement detection techniques, high-quality data sets are becoming immensely important. Objective: In this study, we introduce a manually labeled data set of tweets containing anti-Asian stigmatizing content. Methods: We sampled over 668 million tweets posted on Twitter from January to July 2020 and used an iterative data construction approach that included 3 different stages of algorithm-driven data selection. Finally, we found volunteers who manually annotated the tweets by hand to arrive at a high-quality data set of tweets and a second, more sampled data set with higher-quality labels from multiple annotators. We presented this final high-quality Twitter data set on stigma toward Chinese people during the COVID-19 pandemic. The data set and instructions for labeling can be viewed in the Github repository. Furthermore, we implemented some state-of-the-art models to detect stigmatizing tweets to set initial benchmarks for our data set. Results: Our primary contributions are labeled data sets. Data Set v3.0 contained 11,263 tweets with primary labels (unknown/irrelevant, not-stigmatizing, stigmatizing-low, stigmatizing-medium, stigmatizing-high) and tweet subtopics (eg, wet market and eating habits, COVID-19 cases, bioweapon). Data Set v3.1 contained 4998 (44.4%) tweets randomly sampled from Data Set v3.0, where a second annotator labeled them only on the primary labels and then a third annotator resolved conflicts between the first and second annotators. To demonstrate the usefulness of our data set, preliminary experiments on the data set showed that the Bidirectional Encoder Representations from Transformers (BERT) model achieved the highest accuracy of 79% when detecting stigma on unseen data with traditional models, such as a support vector machine (SVM) performing at 73% accuracy. Conclusions: Our data set can be used as a benchmark for further qualitative and quantitative research and analysis around the issue. It first reaffirms the existence and significance of widespread discrimination and stigma toward the Asian population worldwide. Moreover, our data set and subsequent arguments should assist other researchers from various domains, including psychologists, public policy authorities, and sociologists, to analyze the complex economic, political, historical, and cultural underlying roots of anti-Asian stigmatization and hateful behaviors. A manually annotated data set is of paramount importance for developing algorithms that can be used to detect stigma or problematic text, particularly on social media. We believe this contribution will help predict and subsequently design interventions that will significantly help reduce stigma, hate, and discrimination against marginalized populations during future crises like COVID-19.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Profiles of a COVID-19 Syndemic: Anti-Asian Racism, Economic Challenges, and Mental and Physical Health
    McGarity-Palmer, Rebecca
    Saw, Anne
    Horse, Aggie J. Yellow
    Yi, Stella S. S.
    Tsoh, Janice
    Takeuchi, David
    JOURNAL OF RACIAL AND ETHNIC HEALTH DISPARITIES, 2024, 11 (01) : 300 - 312
  • [42] Anti-Asian discrimination and the Asian-white mental health gap during COVID-19
    Wu, Cary
    Qian, Yue
    Wilkes, Rima
    ETHNIC AND RACIAL STUDIES, 2021, 44 (05) : 819 - 835
  • [43] Reviving the "Yellow Peril" Digitally: Anti-Asian Hate on Twitter During the COVID-19 Pandemic
    Tu, Fangjing
    Jiang, Shanshan
    Gong, Xue
    INTERNATIONAL JOURNAL OF COMMUNICATION, 2024, 18 : 1765 - 1788
  • [44] Asian Americans and the Impact of Anti-Asian Racism During the COVID-19 Pandemic: Part II
    Kim, June
    Tummala-Narra, Pratyusha
    ASIAN AMERICAN JOURNAL OF PSYCHOLOGY, 2022, 13 (04) : 315 - 317
  • [45] "Go Back to China With Your (Expletive) Virus": A Revelatory Case Study of Anti-Asian Racism During COVID-19
    Wang, Sherry C.
    Santos, Bianca Marie C.
    ASIAN AMERICAN JOURNAL OF PSYCHOLOGY, 2022, 13 (03) : 220 - 233
  • [46] Assessment of the Impact of Media Coverage on COVID-19-Related Google Trends Data: Infodemiology Study
    Sousa-Pinto, Bernardo
    Anto, Aram
    Czarlewski, Wienia
    Anto, Josep M.
    Fonseca, Joao Almeida
    Bousquet, Jean
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (08)
  • [47] The Politicization of COVID-19 and Anti-Asian Racism in the United States: An Experimental Approach
    Kim, D. G.
    JOURNAL OF EXPERIMENTAL POLITICAL SCIENCE, 2024, 11 (01) : 1 - 11
  • [48] "I AM NOT A VIRUS": COVID-19, Anti-Asian Hate, and Comics as Counternarratives
    Venkatesan, Sathyaraj
    Joshi, Ishani Anwesha
    JOURNAL OF MEDICAL HUMANITIES, 2024, 45 (01) : 35 - 51
  • [49] COVID-19 Anti-Asian Racism: A Tripartite Model of Collective Psychosocial Resilience
    Cheng, Hsiu-Lan
    Kim, Helen Youngju
    Reynolds , Jason D.
    Tsong, Yuying
    Wong, Y. Joel
    AMERICAN PSYCHOLOGIST, 2021, 76 (04) : 627 - 642
  • [50] Algorithm for Individual Prediction of COVID-19-Related Hospitalization Based on Symptoms: Development and Implementation Study
    Murtas, Rossella
    Morici, Nuccia
    Cogliati, Chiara
    Puoti, Massimo
    Omazzi, Barbara
    Bergamaschi, Walter
    Voza, Antonio
    Querini, Patrizia Rovere
    Stefanini, Giulio
    Manfredi, Maria Grazia
    Zocchi, Maria Teresa
    Mangiagalli, Andrea
    Brambilla, Carla Vittoria
    Bosio, Marco
    Corradin, Matteo
    Cortellaro, Francesca
    Trivelli, Marco
    Savonitto, Stefano
    Russo, Antonio Giampiero
    JMIR PUBLIC HEALTH AND SURVEILLANCE, 2021, 7 (11):