The Efficacy of Large Language Models and Crowd Annotation for Accurate Content Analysis of Political Social Media Messages

被引：0

作者：

Stromer-Galley, Jennifer ^{[1
,2
]}

McKernan, Brian ^{[3
]}

Zaman, Saklain ^{[2
]}

Maganur, Chinmay ^{[1
]}

Regmi, Sampada ^{[1
]}

机构：

[1] Syracuse Univ, Syracuse, NY USA

[2] Syracuse Univ, Sch Informat Studies, 343 Hinds Hall, Syracuse, NY 13244 USA

[3] Pace Univ, Dept Commun & Media Studies, New York, NY USA

来源：

SOCIAL SCIENCE COMPUTER REVIEW | 2025年

关键词：

large language models; artificial intelligence; crowdsourcing; content analysis; social media; machine learning; CANDIDATES; TWITTER;

D O I：

10.1177/08944393251334977

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Systematic content analysis of messaging has been a staple method in the study of communication. While computer-assisted content analysis has been used in the field for three decades, advances in machine learning and crowd-based annotation combined with the ease of collecting volumes of text-based communication via social media have made the opportunities for classification of messages easier and faster. The greatest advancement yet might be in the form of general intelligence large language models (LLMs), which are ostensibly able to accurately and reliably classify messages by leveraging context to disambiguate meaning. It is unclear, however, how effective LLMs are in deploying the method of content analysis. In this study, we compare the classification of political candidate social media messages between trained annotators, crowd annotators, and large language models from Open AI accessed through the free Web (ChatGPT) and the paid API (GPT API) on five different categories of political communication commonly used in the literature. We find that crowd annotation generally had higher F1 scores than ChatGPT and an earlier version of the GPT API, although the newest version, GPT-4 API, demonstrated good performance as compared with the crowd and with ground truth data derived from trained student annotators. This study suggests the application of any LLM to an annotation task requires validation, and that freely available and older LLM models may not be effective for studying human communication.

引用

页数：22

共 67 条

[11]

Dhurandhar A, 2024, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, P2431

[12] Impact of Annotator Demographics on Sentiment Dataset Labeling [J].