Characterizing the Prevalence of Obesity Misinformation, Factual Content, Stigma, and Positivity on the Social Media Platform Reddit Between 2011 and 2019: Infodemiology Study

被引:6
作者
Pollack, Catherine C. [1 ,2 ]
Emond, Jennifer A. [1 ,3 ]
O'Malley, A. James [1 ,4 ]
Byrd, Anna [2 ]
Green, Peter [2 ]
Miller, Katherine E. [2 ]
Vosoughi, Soroush [5 ]
Gilbert-Diamond, Diane [2 ,3 ,6 ]
Onega, Tracy [7 ]
机构
[1] Geisel Sch Med Dartmouth, Dept Biomed Data Sci, Rubin 8331,Med Ctr Dr, Lebanon, NH 03756 USA
[2] Geisel Sch Med Dartmouth, Dept Epidemiol, Lebanon, NH USA
[3] Geisel Sch Med Dartmouth, Dept Pediat, Lebanon, NH USA
[4] Dartmouth Inst Hlth Policy & Clin Practice, Hanover, NH USA
[5] Dartmouth Coll, Dept Comp Sci, Hanover, NH USA
[6] Geisel Sch Med Dartmouth, Dept Med, Lebanon, NH USA
[7] Univ Utah, Huntsman Canc Inst, Dept Populat Hlth Sci, Salt Lake City, UT USA
关键词
obesity; misinformation; social stigma; social media; Reddit; natural language processing; ZERO-MODIFIED COUNT; WEIGHT STIGMA; SEMICONTINUOUS DATA; HIGH AGREEMENT; LOW KAPPA; HEALTH; IMPACT; OVERWEIGHT; PEOPLE;
D O I
10.2196/36729
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Reddit is a popular social media platform that has faced scrutiny for inflammatory language against those with obesity, yet there has been no comprehensive analysis of its obesity-related content. Objective: We aimed to quantify the presence of 4 types of obesity-related content on Reddit (misinformation, facts, stigma, and positivity) and identify psycholinguistic features that may be enriched within each one. Methods: All sentences (N=764,179) containing "obese" or "obesity" from top-level comments (n=689,447) made on non-age-restricted subreddits (ie, smaller communities within Reddit) between 2011 and 2019 that contained one of a series of keywords were evaluated. Four types of common natural language processing features were extracted: bigram term frequency-inverse document frequency, word embeddings derived from Bidirectional Encoder Representations from Transformers, sentiment from the Valence Aware Dictionary for Sentiment Reasoning, and psycholinguistic features from the Linguistic Inquiry and Word Count Program. These features were used to train an Extreme Gradient Boosting machine learning classifier to label each sentence as 1 of the 4 content categories or other. Two-part hurdle models for semicontinuous data (which use logistic regression to assess the odds of a 0 result and linear regression for continuous data) were used to evaluate whether select psycholinguistic features presented differently in misinformation (compared with facts) or stigma (compared with positivity). Results: After removing ambiguous sentences, 0.47% (3610/764,179) of the sentences were labeled as misinformation, 1.88% (14,366/764,179) were labeled as stigma, 1.94% (14,799/764,179) were labeled as positivity, and 8.93% (68,276/764,179) were labeled as facts. Each category had markers that distinguished it from other categories within the data as well as an external corpus. For example, misinformation had a higher average percent of negations (beta=3.71, 95% CI 3.53-3.90; P<.001) but a lower average number of words >6 letters (beta=-1.47, 95% CI -1.85 to -1.10; P<.001) relative to facts. Stigma had a higher proportion of swear words (beta=1.83, 95% CI 1.62-2.04; P<.001) but a lower proportion of first-person singular pronouns (beta=-5.30, 95% CI -5.44 to -5.16; P<.001) relative to positivity. Conclusions: There are distinct psycholinguistic properties between types of obesity-related content on Reddit that can be leveraged to rapidly identify deleterious content with minimal human intervention and provide insights into how the Reddit population perceives patients with obesity. Future work should assess whether these properties are shared across languages and other social media platforms.
引用
收藏
页数:15
相关论文
共 59 条
[1]  
Aggarwal A, 2020, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS 2020), P871, DOI [10.1109/iciccs48265.2020.9121046, 10.1109/ICICCS48265.2020.9121046]
[2]  
Dang A, 2016, PROCEEDINGS OF THE 2016 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING ASONAM 2016, P777, DOI 10.1109/ASONAM.2016.7752326
[3]  
[Anonymous], 2018, 2018 REQ 2018 COMM R
[4]  
[Anonymous], 2008, MED SUBJ HEAD
[5]  
[Anonymous], REDD OB MIS STIGM
[6]  
Auxier B., 2021, Social media use in 2021: A majority of Americans say they use YouTube and Facebook, while use of Instagram, Snapchat and TikTok is especially common among adults under 30
[7]  
Baumgartner J., 2020, P INT AAAI C WEB SOC, V14, P830, DOI [DOI 10.48550/ARXIV.2001.08435, 10.1609/icwsm.v14i1.7347, DOI 10.1609/ICWSM.V14I1.7347]
[8]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[9]   Selective inference in complex research [J].
Benjamini, Yoav ;
Heller, Ruth ;
Yekutieli, Daniel .
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2009, 367 (1906) :4255-4271
[10]  
Bird S, 2009, Natural language processing with Python: analyzing text with the natural language toolkit