Deep Learning in Employee Selection: Evaluation of Algorithms to Automate the Scoring of Open-Ended Assessments

被引：15

作者：

Thompson, Isaac ^{[1
]}

Koenig, Nick ^{[1
]}

Mracek, Derek L. ^{[1
]}

Tonidandel, Scott ^{[2
]}

机构：

[1] Modern Hire, Cleveland, OH USA

[2] Univ N Carolina, Charlotte, NC 28223 USA

来源：

JOURNAL OF BUSINESS AND PSYCHOLOGY | 2023年 / 38卷 / 03期

关键词：

Deep learning; Machine learning; Natural language processing; Measurement; GUIDELINES; KNOWLEDGE; VALIDITY;

D O I：

10.1007/s10869-023-09874-y

中图分类号：

F [经济];

学科分类号：

02 ;

摘要：

This paper explores the application of deep learning in automating the scoring of open-ended candidate responses to pre-hire employment selection assessments. Using job applicant text data from pre-employment virtual assessment center exercises, three algorithmic approaches were compared: a traditional bag of words (BoW), long short-term memory (LSTM) models, and robustly optimized bidirectional encoder representations from transformers approach (RoBERTa). Measurement and assessment best practices were leveraged in the development of the candidate assessment items and human labels (subject matter experts' (SME) ratings on job-relevant competencies), producing a rich set of data to train the algorithms on. The trained models were used to score the candidate textual responses on the given competencies, and the level of agreement with expert human raters was assessed. Using data from three companies hiring for three different occupations and across seven competencies, three algorithmic approaches to automatically score text were evaluated, showcasing correlations between SME and algorithmically scored competencies on holdout samples that were very strong (avg r = 0.84 for the best performing method, RoBERTa) and nearly identical to the inter-rater reliability achieved by multiple expert raters following consensus (avg r = 0.85). Criterion-related validity, subgroup differences, and decision accuracy are investigated for each algorithmic approach. Lastly, the impact of smaller sample sizes to train the algorithms is explored.

引用

页码：509 / 527

页数：19

共 50 条

[1]

Aghajanyan A, 2020, ARXIV

[2]

American Educational Research Association American Psychological Association & National Council on Measurement in Education, 2014, Standards for educational and psychological testing

[3]

[Anonymous], 1978, UNIFORM GUIDELINES E

[4]

[Anonymous], 2017, ARXIV160904747

[5]

Benaich N., 2020, State of AI report

[6] VALIDITY OF PERSONNEL DECISIONS - A CONCEPTUAL ANALYSIS OF THE INFERENTIAL AND EVIDENTIAL BASES [J].

BINNING, JF ;

BARRETT, GV .

JOURNAL OF APPLIED PSYCHOLOGY, 1989, 74 (03) :478-494

[7] Bias and Fairness in Multimodal Machine Learning: A Case Study of Automated Video Interviews [J].

Booth, Brandon M. ;

Hickman, Louis ;

Subburaj, Shree Krishna ;

Tay, Louis ;

Woo, Sang Eun ;

D'Mello, Sidney K. .

PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2021, 2021, :268-277

[8]

Campion E. D., 2020, RES PERSONNEL HUMAN, V38, P287, DOI DOI 10.1108/S0742-730120200000038010

[9] Initial Investigation Into Computer Scoring of Candidate Essays for Personnel Selection [J].

Campion, Michael C. ;

Campion, Michael A. ;

Campion, Emily D. ;

Reider, Matthew H. .

JOURNAL OF APPLIED PSYCHOLOGY, 2016, 101 (07) :958-975

[10]

Chollet F, 2015, Keras

← 1 2 3 4 5 →