A Meta-Analysis of Machine Learning-Based Science Assessments: Factors Impacting Machine-Human Score Agreements

被引:44
作者
Zhai, Xiaoming [1 ]
Shi, Lehong [2 ]
Nehm, Ross H. [3 ]
机构
[1] Univ Georgia, Dept Math & Sci Educ, Athens, GA 30602 USA
[2] Univ Georgia, Dept Career & Informat Studies, Athens, GA 30605 USA
[3] SUNY Stony Brook, Dept Ecol & Evolut, Stony Brook, NY 11794 USA
基金
美国国家科学基金会;
关键词
Machine learning; Science assessment; Meta-analysis; Interrater reliability; Validity; Cohen's kappa; Artificial Intelligence; AUTOMATED GUIDANCE; FORMATIVE ASSESSMENT; ONLINE; ESSAYS; EXPLANATIONS; REVISION; FEEDBACK; KAPPA; TOOL;
D O I
10.1007/s10956-020-09875-z
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Machine learning (ML) has been increasingly employed in science assessment to facilitate automatic scoring efforts, although with varying degrees of success (i.e., magnitudes of machine-human score agreements [MHAs]). Little work has empirically examined the factors that impact MHA disparities in this growing field, thus constraining the improvement of machine scoring capacity and its wide applications in science education. We performed a meta-analysis of 110 studies of MHAs in order to identify the factors most strongly contributing to scoring success (i.e., high Cohen's kappa [kappa]). We empirically examined six factors proposed as contributors to MHA magnitudes: algorithm, subject domain, assessment format, construct, school level, and machine supervision type. Our analyses of 110 MHAs revealed substantial heterogeneity in kappa(mean=.64; range = .09-.97, taking weights into consideration). Using three-level random-effects modeling, MHA score heterogeneity was explained by the variability both within publications (i.e., the assessment task level: 82.6%) and between publications (i.e., the individual study level: 16.7%). Our results also suggest that all six factors have significant moderator effects on scoring success magnitudes. Among these, algorithm and subject domain had significantly larger effects than the other factors, suggesting that technical features and assessment external features might be primary targets for improving MHAs and ML-based science assessments.
引用
收藏
页码:361 / 379
页数:19
相关论文
共 101 条
[1]  
Altman D.G., 1991, Practical Statistics for Medical Research, P406
[2]   Designing educational systems to support enactment of the Next Generation Science Standards [J].
Anderson, Charles W. ;
de los Santos, Elizabeth X. ;
Bodbyl, Sarah ;
Covitt, Beth A. ;
Edwards, Kirsten D. ;
Hancock, James Brian, II ;
Lin, Qinyun ;
Thomas, Christie Morrison ;
Penuel, William R. ;
Welch, Mary Margaret .
JOURNAL OF RESEARCH IN SCIENCE TEACHING, 2018, 55 (07) :1026-1052
[3]  
Bartolucci AA, 2010, EVIDENCE-BASED PRACTICE: TOWARD OPTIMIZING CLINICAL OUTCOMES, P17, DOI 10.1007/978-3-642-05025-1_2
[4]   Assessing Scientific Practices Using Machine-Learning Methods: How Closely Do They Match Clinical Interview Performance? [J].
Beggrow, Elizabeth P. ;
Ha, Minsu ;
Nehm, Ross H. ;
Pearl, Dennis ;
Boone, William J. .
JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY, 2014, 23 (01) :160-182
[5]  
Borenstein M., 2021, INTRO META ANAL, DOI [DOI 10.1002/9780470743386, 10.1002/9780470743386]
[6]   Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country [J].
Bridgeman, Brent ;
Trapani, Catherine ;
Attali, Yigal .
APPLIED MEASUREMENT IN EDUCATION, 2012, 25 (01) :27-40
[7]  
Castelvecchi D, 2016, NATURE, V538, P21, DOI [10.1038/nature.2016.20491, 10.1038/538020a]
[8]  
Chanijani S. S. M., 2016, INT C HUM COMP INT M
[9]   Assessing the attention levels of students by using a novel attention aware system based on brainwave signals [J].
Chen, Chih-Ming ;
Wang, Jung-Ying ;
Yu, Chih-Ming .
BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 2017, 48 (02) :348-369
[10]  
Chen J., 2019, RES SCI EDUC, P1, DOI DOI 10.1002/RRA.3487