The use of test scores from large-scale assessment surveys: psychometric and statistical considerations

被引:20
作者
Braun H. [1 ]
von Davier M. [2 ]
机构
[1] Lynch School of Education, Campion Hall, Boston College, 140 Commonwealth Avenue, Chestnut Hill, 02467, MA
[2] National Board of Medical Examiners, 3750 Market Street, Philadelphia, 19104, PA
关键词
Conditioning model; Imputation; IRT; Large-scale assessment; Plausible values; Unbiasedness;
D O I
10.1186/s40536-017-0050-x
中图分类号
学科分类号
摘要
Background: Economists are making increasing use of measures of student achievement obtained through large-scale survey assessments such as NAEP, TIMSS, and PISA. The construction of these measures, employing plausible value (PV) methodology, is quite different from that of the more familiar test scores associated with assessments such as the SAT or ACT. These differences have important implications both for utilization and interpretation. Although much has been written about PVs, it appears that there are still misconceptions about whether and how to employ them in secondary analyses. Methods: We address a range of technical issues, including those raised in a recent article that was written to inform economists using these databases. First, an extensive review of the relevant literature was conducted, with particular attention to key publications that describe the derivation and psychometric characteristics of such achievement measures. Second, a simulation study was carried out to compare the statistical properties of estimates based on the use of PVs with those based on other, commonly used methods. Results: It is shown, through both theoretical analysis and simulation, that under fairly general conditions appropriate use of PV yields approximately unbiased estimates of model parameters in regression analyses of large scale survey data. The superiority of the PV methodology is particularly evident when measures of student achievement are employed as explanatory variables. Conclusions: The PV methodology used to report student test performance in large scale surveys remains the state-of-the-art for secondary analyses of these databases. © 2017, The Author(s).
引用
收藏
相关论文
共 64 条
  • [1] Adams R.J., Wilson M., Wu M., Multilevel item response modelling: An approach to errors in variables regression, Journal of Educational and Behavioral Statistics, 22, 1, pp. 47-76, (1997)
  • [2] Adams R., Wu M., The mixed-coefficients multinomial logit model: A generalized form of the Rasch model, Multivariate and Mixture Distribution Rasch Models: Extensions and Applications, pp. 57-76, (2007)
  • [3] Andersen E.B., The numerical solution of a set of conditional estimation equations, Journal of the Royal Statistical Society: Series B, 34, 1, pp. 42-54, (1972)
  • [4] Andersen E.B., Latent regression analysis based on the rating scale model, Psychology Science, 46, 2, pp. 209-226, (2004)
  • [5] Ballou D., Test scaling and value-added measurement, Education Finance and Policy, 4, 4, pp. 351-383, (2009)
  • [6] Bartlett J., Seaman S., White I., Carpenter J., Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model, Statistical Methods in Medical Research, 24, 4, pp. 462-487, (2014)
  • [7] Bauer D.J., Hussong A., Psychometric approaches for developing commensurate measures across independent studies: Traditional and new models, Psychological Methods, 14, 2, pp. 101-125, (2009)
  • [8] Bond T.N., Lang K., The Evolution of the black–white test score gap in grades K-3: The fragility of results, Review of Economics and Statistics, 95, 5, pp. 1468-1479, (2013)
  • [9] Braun H.I., Mislevy R.M., Intuitive test theory, Phi Delta Kappan, 86, 7, pp. 489-497, (2005)
  • [10] Briggs D.C., Using explanatory item response models to analyze group differences in science achievement, Applied Measurement in Education, 21, 2, pp. 89-118, (2008)