Closed- and Open-Vocabulary Approaches to Text Analysis: A Review, Quantitative Comparison, and Recommendations

被引:87
作者
Eichstaedt, Johannes C. [1 ,2 ]
Kern, Margaret L. [3 ]
Yaden, David B. [4 ]
Schwartz, H. A. [5 ]
Giorgi, Salvatore [6 ]
Park, Gregory [6 ]
Hagan, Courtney A. [6 ]
Tobolsky, Victoria A. [6 ]
Smith, Laura K. [6 ]
Buffone, Anneke [6 ]
Iwry, Jonathan [6 ]
Seligman, Martin E. P. [6 ]
Ungar, Lyle H. [6 ]
机构
[1] Stanford Univ, Dept Psychol, 450 Jane Stanford Way,Bldg 420, Stanford, CA 94305 USA
[2] Stanford Univ, Inst Human Ctr AI, Stanford, CA 94305 USA
[3] Univ Melbourne, Melbourne Grad Sch Educ, Melbourne, Vic, Australia
[4] Johns Hopkins Med, Dept Psychiat & Behav Sci, Baltimore, MD USA
[5] SUNY Stony Brook, Dept Comp Sci, New York, NY USA
[6] Univ Penn, Dept Psychol, Philadelphia, PA 19104 USA
关键词
text analysis; computational social science; method comparison; language; natural language processing; LATENT SEMANTIC ANALYSIS; SOCIAL MEDIA; LANGUAGE USE; NATURAL-LANGUAGE; SECRET LIFE; WORDS; REPRESENTATIONS; DICTIONARIES; TRAITS; MODELS;
D O I
10.1037/met0000349
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
Technology now makes it possible to understand efficiently and at large scale how people use language to reveal their everyday thoughts, behaviors, and emotions. Written text has been analyzed through both theory-based, closed-vocabulary methods from the social sciences as well as datadriven, open-vocabulary methods from computer science, but these approaches have not been comprehensively compared. To provide guidance on best practices for automatically analyzing written text, this narrative review and quantitative synthesis compares five predominant closed- and open-vocabulary methods: Linguistic Inquiry and Word Count (LIWC), the General Inquirer, DICTION, Latent Dirichlet Allocation, and Differential Language Analysis. We compare the linguistic features associated with gender, age, and personality across the five methods using an existing dataset of Facebook status updates and self-reported survey data from 65,896 users. Results are fairly consistent across methods. The closed-vocabulary approaches efficiently summarize concepts and are helpful for understanding how people think, with LIWC2015 yielding the strongest, most parsimonious results. Open- vocabulary approaches reveal more specific and concrete patterns across a broad range of content domains, better address ambiguous word senses, and are less prone to misinterpretation, suggesting that they are well-suited for capturing the nuances of everyday psychological processes. We detail several errors that can occur in closed-vocabulary analyses, the impact of sample size, number of words per user and number of topics included in open-vocabulary analyses, and implications of different analytical decisions. We conclude with recommendations for researchers, advocating for a complementary approach that combines closed- and open-vocabulary methods. Translational Abstract A considerable amount of text data exists online that capture people's everyday thoughts, emotions, and behaviors. Technological advances now make it possible to analyze such data efficiently and at large scale, providing insights into everyday psychological processes as they occur in the real world. To provide guidance on best practice approaches for using such data effectively, this synthesis reviews and quantitively compares the main closed-vocabulary approaches (theoretically derived lists of words from the social sciences) and open-vocabulary approaches (data-driven techniques from computer science that explore many words, phrases, and topics) for automated text analysis. We find that the different methods are complementary; closed-vocabulary approaches provide a way to study the fundamental patterns of how people think and feel, whereas open-vocabulary approaches best elucidate what people think and feel.
引用
收藏
页码:398 / 427
页数:30
相关论文
共 128 条
[1]  
Abdul-Mageed Muhammad, 2017, P INT AAAI C WEB SOC, V11
[2]   Judging the frequency of English words [J].
Alderson, J. Charles .
APPLIED LINGUISTICS, 2007, 28 (03) :383-409
[3]  
Almodaresi F., 2017, ANN M ASS COMP LING
[4]  
Anderson Ashton, 2012, P ACL2012 SPECIAL WO, P13
[5]  
[Anonymous], 2016, The content analysis guidebook, DOI DOI 10.4135/9781071802878
[6]  
[Anonymous], 2012, P ACM 2012 C COMP SU
[7]  
[Anonymous], 2011, P 5 INT AAAI C WEBL, DOI DOI 10.1609/ICWSM.V5I1.14171
[8]  
[Anonymous], 2013, P 22 INT C WORLD WID, DOI 10.1145/2488388.2488416
[9]  
[Anonymous], 2011, ACM CHI
[10]  
[Anonymous], 2012, P WWW