Empath: Understanding Topic Signals in Large-Scale Text

被引:206
作者
Fast, Ethan [1 ]
Chen, Binbin [1 ]
Bernstein, Michael S. [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
来源
34TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2016 | 2016年
关键词
social computing; computational social science; fiction;
D O I
10.1145/2858036.2858535
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.
引用
收藏
页码:4647 / 4657
页数:11
相关论文
共 43 条
[1]  
[Anonymous], 2014, Information Processing Management
[2]  
[Anonymous], 2001, LINGUISTIC INQUIRY W
[3]  
[Anonymous], 1966, The general inquirer: A computer approach to content analysis
[4]  
Bollen J, 2011, Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena, V5, P450
[5]  
Bradley M.M., 1999, PSYCHOLOGY
[6]  
Chambers Nathanael, P ACL 2009
[7]  
Danescu-Niculescu-Mizil C., P ACL 2013
[8]  
Davis H., 2014, ARXIV14032124
[9]  
De Choudhury Munmun, P HCI KOR 2014
[10]   Approximate statistical tests for comparing supervised classification learning algorithms [J].
Dietterich, TG .
NEURAL COMPUTATION, 1998, 10 (07) :1895-1923