Clustering Examples in Multi-Dataset NLP Benchmarks with Item Response Theory

被引:0
作者
Rodriguez, Pedro
Htut, Phu Mon [1 ]
Lalor, John P. [2 ]
Sedoc, Joao [1 ]
机构
[1] NYU, New York, NY 10003 USA
[2] Univ Notre Dame, Notre Dame, IN 46556 USA
来源
PROCEEDINGS OF THE THIRD WORKSHOP ON INSIGHTS FROM NEGATIVE RESULTS IN NLP (INSIGHTS 2022) | 2022年
关键词
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.
引用
收藏
页码:100 / 112
页数:13
相关论文
共 34 条
[21]   A preference-based item response theory model to measure health: concept and mathematics of the multi-attribute preference response model [J].
Groothuis-Oudshoorn, Catharina G. M. ;
van den Heuvel, Edwin R. ;
Krabbe, Paul F. M. .
BMC MEDICAL RESEARCH METHODOLOGY, 2018, 18
[22]   Using classical test theory, item response theory, and Rasch measurement theory to evaluate patient-reported outcome measures: a comparison of worked examples (vol 18, pg 25, 2015) [J].
Petrillo, J. ;
Cano, S. J. ;
McLeod, L. D. ;
Coon, C. D. .
VALUE IN HEALTH, 2015, 18 (04) :547-547
[23]   Differentiation and prognosis of healthy subjects, swedds and parkinson's patients using a multi-dimensional item response theory model [J].
van Dijkman, S. ;
Ueckert, S. ;
Plan, E. L. ;
Karlsson, M. O. .
JOURNAL OF THE NEUROLOGICAL SCIENCES, 2017, 381 :97-98
[24]   Application of Bayesian inference using Gibbs sampling to item-response theory modeling of multi-symptom genetic data [J].
Eaves, L ;
Erkanli, A ;
Silberg, J ;
Angold, A ;
Maes, HH ;
Foley, D .
BEHAVIOR GENETICS, 2005, 35 (06) :765-780
[25]   JMASM28: Gibbs Sampling for 2PNO Multi-unidimensional Item Response Theory Models (Fortran) [J].
Sheng, Yanyan ;
Headrick, Todd C. .
JOURNAL OF MODERN APPLIED STATISTICAL METHODS, 2009, 8 (02) :646-658
[26]   Application of Bayesian Inference using Gibbs Sampling to Item-Response Theory Modeling of Multi-Symptom Genetic Data [J].
Lindon Eaves ;
Alaattin Erkanli ;
Judy Silberg ;
Adrian Angold ;
Hermine H. Maes ;
Debra Foley .
Behavior Genetics, 2005, 35 :765-780
[27]   Evaluating HIV Knowledge Questionnaires Among Men Who Have Sex with Men: A Multi-Study Item Response Theory Analysis [J].
Patrick Janulis ;
Michael E. Newcomb ;
Patrick Sullivan ;
Brian Mustanski .
Archives of Sexual Behavior, 2018, 47 :107-119
[28]   Evaluating HIV Knowledge Questionnaires Among Men Who Have Sex with Men: A Multi-Study Item Response Theory Analysis [J].
Janulis, Patrick ;
Newcomb, Michael E. ;
Sullivan, Patrick ;
Mustanski, Brian .
ARCHIVES OF SEXUAL BEHAVIOR, 2018, 47 (01) :107-119
[29]   USING ITEM RESPONSE THEORY TO IDENTIFY RESPONDERS TO TREATMENT: EXAMPLES WITH THE PATIENT-REPORTED OUTCOMES MEASUREMENT INFORMATION SYSTEM (PROMIS®) PHYSICAL FUNCTION SCALE AND EMOTIONAL DISTRESS COMPOSITE [J].
Hays, Ron D. ;
Spritzer, Karen L. ;
Reise, Steven P. .
PSYCHOMETRIKA, 2021, 86 (03) :781-792
[30]   Using Item Response Theory to Identify Responders to Treatment: Examples with the Patient-Reported Outcomes Measurement Information System (PROMIS®) Physical Function Scale and Emotional Distress Composite [J].
Ron D. Hays ;
Karen L. Spritzer ;
Steven P. Reise .
Psychometrika, 2021, 86 :781-792