Using Natural Language Processing and Machine Learning to Replace Human Content Coders

被引:17
作者
Wang, Yilei [1 ]
Tian, Jingyuan [1 ]
Yazar, Yagizhan [1 ]
Ones, Deniz S. [1 ]
Landers, Richard N. [1 ]
机构
[1] Univ Minnesota Twin Cities, Dept Psychol, N218 Elliott Hall,75 East River Rd, Minneapolis, MN 55455 USA
关键词
natural language processing; machine learning; content analysis; text classification; environmental sustainability; SELECTION; VALUES; WORK; SIZE;
D O I
10.1037/met0000518
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research. Translational Abstract As psychological research enters the "Big Data" era where much richer, yet often unstructured data can be utilized to answer novel research questions, it becomes increasingly important for researchers and practitioners to possess necessary techniques to analyze these data. Content analysis of psychological data is an often-utilized technique with practical shortcomings where human coders process and code, by psychological science standards, large amounts of data. The present study demonstrated that NLP and ML can be engineered to predict human codes with sufficiently high accuracy to maintain psychometric rigor, thereby saving considerable time and human coder efforts. Furthermore, based on the results of a Monte-Carlo simulation, practical guidelines and recommendations on the necessary dataset characteristics (e.g., sample size) to achieve valid prediction of content codes were provided for practitioners who wish to adopt this technique.
引用
收藏
页码:1148 / 1163
页数:17
相关论文
共 77 条
[1]   Semantic text classification: A survey of past and recent advances [J].
Altinel, Berna ;
Ganiz, Murat Can .
INFORMATION PROCESSING & MANAGEMENT, 2018, 54 (06) :1129-1153
[2]  
[Anonymous], 2006, GESTS International Transactions on Computer Science and Engineering
[3]  
[Anonymous], 2005, Technical Report Technical Report 1530
[4]  
[Anonymous], 2006, 10 INT WORKSH FRONT
[5]  
Babbie E., 2011, The basics of social research, V5th
[6]   Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review [J].
Balki, Indranil ;
Amirabadi, Afsaneh ;
Levman, Jacob ;
Martel, Anne L. ;
Emersic, Ziga ;
Meden, Blaz ;
Garcia-Pedrero, Angel ;
Ramirez, Saul C. ;
Kong, Dehan ;
Moody, Alan R. ;
Tyrrell, Pascal N. .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2019, 70 (04) :344-353
[7]   Qualitative elicitation of affective beliefs related to physical activity [J].
Bellows-Riecken, Kai ;
Mark, Rachel ;
Rhodes, Ryan E. .
PSYCHOLOGY OF SPORT AND EXERCISE, 2013, 14 (05) :786-792
[8]   A comparative content analysis of face-to-face vs. asynchronous group decision making [J].
Benbunan-Fich, R ;
Hiltz, SR ;
Turoff, M .
DECISION SUPPORT SYSTEMS, 2003, 34 (04) :457-469
[9]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[10]  
Bhavitha BK, 2017, PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICICCT), P216, DOI 10.1109/ICICCT.2017.7975191